Problem

In order to debug our current classifier models after training, is useful to have a way to check accuracy by hand with known articles. In general, we would like to have a developing suite to easily perform common operations when doing classifier experiments.

Proposed Solution:

A few commands were added to improve the workflow for experimenting:

show : This command allows to display known information about an specific article, identified by its url. It can be present or not in the local database, the command it's capable to scraping it if possible. Example:

$ c4v show https://primicia.com.ve/placeres/mick-jagger-y-will-smith-recaudaran-fondos-en-india-con-un-concierto-virtual/
+==========================================================================================================================================================================================================+
    Url        : https://primicia.com.ve/placeres/mick-jagger-y-will-smith-recaudaran-fondos-en-india-con-un-concierto-virtual/
    Title      : Mick Jagger y Will Smith recaudarán fondos en India con un concierto virtual
    Author     : AFP
    Date       : sábado, 02 mayo 2020
    Categories : Bollywood, Concierto, Covid-19, Fondos, India, Recaudación
    Scraped    : 2021-08-26 17:01:53.058615+0000
============================================================================================================================================================================================================
El actor estadounidense Will Smith y la leyenda del rock Mick Jagger forman parte de las estrellas internacionales y de Bollywood que participarán en un concierto online de cuatro horas el domingo.
La intención es recaudar fondos para ayudar a luchar contra la epidemia de nuevo coronavirus en India.
El capitán del equipo nacional de críquet, Virat Kohli, los actores Priyanka Chopra y Shah Rukh Khan se encuentran entre las celebridades indias que actuarán o leerán mensajes desde su domicilio.
Organizado por Karan Johar y Zoya Akhtar, directores de Bollywood, la industria india del cine, el espectáculo será transmitido en directo en Facebook.
El objetivo es reunir millones de dólares para un centenar de organizaciones que proveen servicios esenciales y comida durante la epidemia.
Este dinero es necesario para ayudar a «todos aquellos que no tienen trabajo ni domicilio, y que no saben de dónde sacarán su próxima comida», explicaron los organizadores.
El confinamiento impuesto el 25 de marzo, y al menos hasta el 17 de mayo, a los 1.300 millones de indios dejó en riesgo a millones de trabajadores y dio un gran mazazo a la tercera economía de Asia.
Millones de trabajadores rurales están bloqueados en las ciudades con casi nada para vivir y alimentarse.
El sábado, se establecieron trenes para llevarles de vuelta a sus pueblos y ciudades de origen.
Las estrictas restricciones debían permitir mantener los casos de contagio de nuevo coronavirus a un nivel relativamente bajo en el segundo país más poblado del mundo: 37.335 casos y 1.218 fallecidos, con 2.000 nuevos contagios en las últimas 24 horas.
El balance podría en cambio ser mucho mayor, según los expertos, debido a la falta de tests de detección y de recogida de datos.
Ten la información al instante en tu celular. Únete al grupo de Diario Primicia en WhatsApp a través del siguiente link:
https://chat.whatsapp.com/H3jktHpqn4cKVS4NZdKEuj
También estamos en Telegram como @DiarioPrimicia, únete aquí
https://t.me/diarioprimicia
+==========================================================================================================================================================================================================+

classify : Shows classification output for the given article, indentified by its URL. It requires the branch and experiment name to find the corresponding trained model.For example:

$ c4v classify my_branch/my_experiment https://primicia.com.ve/placeres/mick-jagger-y-will-smith-recaudaran-fondos-en-india-con-un-concierto-virtual/
2021-08-26 13:42:05.228785: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
    https://primicia.com.ve/placeres/mick-jagger-y-will-smith-recaudaran-fondos-en-india-con-un-concierto-virtual/
            * label : IRRELEVANTE
            * scores : [[0.5141750574111938, 0.48582497239112854]]

explain : Shows attention given by the model to each word in a given string. It requires the branch name and experiment name to find the model to use when classifying. For example:

$ c4v explain my_branch/my_test "protestan por falta de agua en acarigua"
2021-08-26 13:49:09.686068: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
<IPython.core.display.HTML object>
Predicted Label: IRRELEVANTE
Scores:
    * <s> : 0.0
    * pro : 0.5418567947146942
    * tes : 0.13859015604668976
    * tan : 0.24077516061392953
    * por : 0.10796004886428462
    * falta : -0.630992190433149
    * de : 0.17505328357635386
    * agua : 0.2577032301677877
    * en : -0.013478356890464087
    * acar : -0.32186553527790523
    * igua : -0.1362757950900742
    * </s> : 0.0

Using this commands and the list command, you can pick news articles with the list command, inspect them with the show command, run a classification with the classify command and inspect the analysis over pieces of the article using the explain command. This way, we have a fairly ergonomic workflow to test a trained model. Note: The previous examples are assuming that a model was trained at some point in the branch my_branch with name my_experiment

Additional Changes

Classifier API

Additionally, I wrote major changes to the classifier api, which was too monolithic:

Created ExperimentFSManager class: such class controls branching and experiment naming, so the classifier shouldn't care about it as long as it has some folder configured for its files
Created BaseExperiment, BaseExperimentArguments and BaseExperimentSummary classes: Controls experiment running and logging, this way every experiment will be called in more or less the same way, so you won't have to write much ad-hoc code for every experiment. BaseExperiment class uses an instance of ExperimentFSManager to write experiments files
Created ExperimentManager
Created Classifier class: This class it's a rewrite of the ClassifierExperiment class, extracting experiment file system logic. Also, a few of its constructor arguments were moved to the run_training.
Implemented ClassifierExperiment, ClassifierArguments and ClassifierSummary classes to provide a classifier experiment Example experiment with the new API:
```
from c4v.classifier.classifier_experiment  import ClassifierExperiment, ClassifierArgs
```

args = ClassifierArgs({ "per_device_train_batch_size" : 10, "per_device_eval_batch_size" : 1, "num_train_epochs" : 1, "warmup_steps" : 10, "load_best_model_at_end" : True,

"metric_for_best_model" : "f1",

    "save_strategy" : "epoch",
    "evaluation_strategy" : "epoch",
    "eval_accumulation_steps" : 1,
    "learning_rate" : 5e-7
},
columns=["title"],
description="Testing my new API"

) exp = ClassifierExperiment.from_branch_and_experiment("new_api", "test")

exp.run_experiment(args)


## Microscope Manager
A high level API to automate common operations with our library. This way, you can easily compose scrapers, crawlers and the classifier: `Manager`

## CLICLient
A simple object using along the CLI commands to encapsulate common logic an operations, so the command-level logic in the CLI will be thinner

# Relevant files:
* `src/c4v/c4v_cli.py`   
    * Added `show`, `classify` and `explain` commands
    * Added `CLIClient` class
* `src/c4v/microscope/manager.py` : High level manager class composing every component
* `src/c4v/classifier/classifier.py` :     
    * Refactor to remove logic concerning to experiment management from `ClassifierExperiment` class
    * Changed `ClassifierExperiment` class for just `Classifier` class
* `src/c4v/classifier/experiment.py`:     
    * File created
    * Added classes for experiments: `BaseExperiment`, `BaseExperimentArguments`, `BaseExperimentSummary`
    * Added class for experiment's files management: `ExperimentFSManager`
* `src/c4v/classifier/classifier_experiment` : implemented the three previously mentioned base classes into `ClassifierExperiment`, `ClassifierArguments` and `ClassifierSummary`

# Additional comments
I know this is a large PR, I wanted to have a full testing suite in order to easily perform experiments. Also, I wanted to do some architecture work to make the code as easily extensible as possible, so I did a lot of refactor and abstraction for both already existing code and new code

code-for-venezuela / c4v-py

Luis/classifier command #89

Problem

Proposed Solution:

Additional Changes

Classifier API

"metric_for_best_model" : "f1",