As much of the fields of the data we're gathering might have more than 2 labels, we need a classifier that can classify for more than just 2 labels.
Solution
Refactor the current classifier class to support an arbitrary number of labels
Provide a way to specify sets of labels
a. Create LabelSet base class to provide support for common labelsets operations, such that mapping from id to label and vice versa, provide the nomber of arguments and hold the string value of each label.
b. Add LabelSet as a configuration parameter in the classifier class
c. Add labelset as an argument to the ClassifierExperimentArgs classs, so that it's easy to set up an experiment to run using a specified labelset
Create training dataset for multilabel classification using a column of our current dataset with multiple values as test subject
a. I choose the "servicio resumido" column as it is easier for me to check classifications, it's simple to understand and manually classify.
b. The confirmation dataset should be provided too
c. The ScrapedData class that works as canonical scheme for scraped data was also updated, meaning that the SQLiteManager class was also updated
Provide a sample experiment with multiple labels as a test
Relevant files
src/c4v/classifier/classifier.py : Model refactor to allow arbitrary-sized label sets
src/c4v/classifier/classifier_experiment.py : Experiment refactor to add label_column and labelset as configuration parameters for the experiment
src/c4v/scraper/scraped_data_classes/scraped_data.py : Added labelset object and reformat in scheme format
src/c4v/scraper/persistency_manager/sqlite_storage_manager.py : Updated to include new data scheme
data : Changes to the datasets:
raw/huggingface/primicia_only_bdd_ovsp_octubre.csv : added this dataset with only relevant news from primicia
relevance_confirmation_dataset.csv and relevance_training_dataset.csv : updated old datasets for binary classification to the new data format required for the general classification model
service_confirmation_dataset.csv and service_training_dataset.csv : added this datasets for service classification
experiment_samples : new experiment sample for service classification
Further work
Create model for multiple labels ( Text -> [Label])
Problem
As much of the fields of the data we're gathering might have more than 2 labels, we need a classifier that can classify for more than just 2 labels.
Solution
ClassifierExperimentArgs
classs, so that it's easy to set up an experiment to run using a specified labelsetScrapedData
class that works as canonical scheme for scraped data was also updated, meaning that theSQLiteManager
class was also updatedRelevant files
src/c4v/classifier/classifier.py
: Model refactor to allow arbitrary-sized label setssrc/c4v/classifier/classifier_experiment.py
: Experiment refactor to addlabel_column
andlabelset
as configuration parameters for the experimentsrc/c4v/scraper/scraped_data_classes/scraped_data.py
: Added labelset object and reformat in scheme formatsrc/c4v/scraper/persistency_manager/sqlite_storage_manager.py
: Updated to include new data schemedata
: Changes to the datasets:raw/huggingface/primicia_only_bdd_ovsp_octubre.csv
: added this dataset with only relevant news from primiciarelevance_confirmation_dataset.csv
andrelevance_training_dataset.csv
: updated old datasets for binary classification to the new data format required for the general classification modelservice_confirmation_dataset.csv
andservice_training_dataset.csv
: added this datasets for service classificationexperiment_samples
: new experiment sample for service classificationFurther work
Text
->[Label]
)