Problem

We need a starting architecture for our classifier, so we can easily do some experiments. Such classifier is implemented in a POC by @dieko95 that you can find here, here we are just integrating that implementation to our current project

Solution

Write a class that automates the entire pipeline for this classifier, the main target is that such class implements this process in multiple steps that can be easily changed or configured using parameters
Use a scheme for managing multiple experiments, so we can manage multiple results: experiments are defined by a branch name and an experiment name, in a folder under the .c4v folder at $HOME by default

Relevant files

src/c4v/classifier/classifier.py : Here we defined the ClassifierExperiment class that automates the entire process of a classification training. The main function is run and you can pass it a dict with the fields for training arguments to override the default settings

How to test it

Create a script, say test.py at the root project folder

Write the folowwing code:


from c4v.classifier.classifier import ClassifierExperiment

branch name, experiment name

experiment = ClassifierExperiment("testing", "first_one")

print(experiment.run_experiment(train_args={'num_train_epochs' : 3}))

* Run script with python
* The following results were obtained from google colab for the previous experiment:

Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 1051.2505, 'train_samples_per_second': 5.479, 'train_steps_per_second': 0.548, 'train_loss': 0.6006555491023593, 'epoch': 3.0} 100% 576/576 [17:31<00:00, 1.83s/it] Configuration saved in /experiments/testing/first_one/config.json Model weights saved in /experiments/testing/first_one/pytorch_model.bin Running Evaluation Num examples = 480 Batch size = 10 100% 48/48 [00:30<00:00, 1.58it/s] metrics_value eval_loss 0.429582 eval_accuracy 0.797917 eval_precision 0.775701 eval_recall 0.772093 eval_f1 0.773893 eval_runtime 31.039800 eval_samples_per_second 15.464000 eval_steps_per_second 1.546000 epoch 3.000000



# Further work
* Do some more experiments to improve classifications
* Add a class to instantiate a classifier from an experiment 
* Add configuration manager to handle configuration variables as the `BASE_C4V_FOLDER` variable in `classifier.py`
* Integrate this class to our architecture:
  *   Write a mapping from `[ScrapedData]` to dataframe 
  *   Add function to get data from `PersistencyManager` as a DataFrame

code-for-venezuela / c4v-py

Luis/nlp classifier module #85

Problem

Solution

Relevant files

How to test it

branch name, experiment name