We need a way to perform an accuracy evaluation for the base language model with newer data in order to know if the model should be retrained. The dataset required can be easily constructed using scraped data, so we also need a way to create datasets based on sets of ScrapedData instances
Solution
Provided a function to check the base language model accuracy LanguageModel.eval_accuracy
Provided the functions:
LanguageModel.create_dataset_from_scraped_data that creates a tokenized dataset using an iterable of scraped data
LanguageModel.to_pt_dataset (just an utility function) that creates a pytorch dataset from a tokenized batch of data (like the one returned by the previous function)
Provided function in microscope.Manager object to check if a language model should be retrained
Additional changes
Addition of the base class BaseModel to provide a common operations for language models, like the classifier or the base language model.
General improvements and refactors to:
Classifier class
ClassifierExperiment class and its related classes
LanguageModel classs
LanguageModelExperiment class and its related classes
Added a sort_by function for the persistency manager, useful for requesting recently scraped data
Added the experiments_samples folder which contains sample experiments you can copy and paste to modify and perform your own local experiments
Added new datasets, a classifier confirmation dataset and a raw new dataset from the client
Relevant files:
src/c4v/microscope/manager.py : added function to check if a language model should be retrained
src/c4v/classifier/language_model/language_model.py : Added eval_accuracy and dataset creation functions
Problem
We need a way to perform an accuracy evaluation for the base language model with newer data in order to know if the model should be retrained. The dataset required can be easily constructed using scraped data, so we also need a way to create datasets based on sets of
ScrapedData
instancesSolution
LanguageModel.eval_accuracy
LanguageModel.create_dataset_from_scraped_data
that creates a tokenized dataset using an iterable of scraped dataLanguageModel.to_pt_dataset
(just an utility function) that creates a pytorch dataset from a tokenized batch of data (like the one returned by the previous function)microscope.Manager
object to check if a language model should be retrainedAdditional changes
BaseModel
to provide a common operations for language models, like the classifier or the base language model.Classifier
classClassifierExperiment
class and its related classesLanguageModel
classsLanguageModelExperiment
class and its related classessort_by
function for the persistency manager, useful for requesting recently scraped dataexperiments_samples
folder which contains sample experiments you can copy and paste to modify and perform your own local experimentsRelevant files:
src/c4v/microscope/manager.py
: added function to check if a language model should be retrainedsrc/c4v/classifier/language_model/language_model.py
: Addedeval_accuracy
and dataset creation functions