Problem

We need a way to perform an accuracy evaluation for the base language model with newer data in order to know if the model should be retrained. The dataset required can be easily constructed using scraped data, so we also need a way to create datasets based on sets of ScrapedData instances

Solution

Provided a function to check the base language model accuracy LanguageModel.eval_accuracy
Provided the functions:
- LanguageModel.create_dataset_from_scraped_data that creates a tokenized dataset using an iterable of scraped data
- LanguageModel.to_pt_dataset (just an utility function) that creates a pytorch dataset from a tokenized batch of data (like the one returned by the previous function)
Provided function in microscope.Manager object to check if a language model should be retrained
Additional changes
Addition of the base class BaseModel to provide a common operations for language models, like the classifier or the base language model.
General improvements and refactors to:
- Classifier class
- ClassifierExperiment class and its related classes
- LanguageModel classs
- LanguageModelExperiment class and its related classes
Added a sort_by function for the persistency manager, useful for requesting recently scraped data
Added the experiments_samples folder which contains sample experiments you can copy and paste to modify and perform your own local experiments
Added new datasets, a classifier confirmation dataset and a raw new dataset from the client

Relevant files:

src/c4v/microscope/manager.py : added function to check if a language model should be retrained
src/c4v/classifier/language_model/language_model.py : Added eval_accuracy and dataset creation functions

code-for-venezuela / c4v-py

Luis/language model eval #92

Problem

Solution

Additional changes

Relevant files: