Problem

Right now, we only have the positive labels dataset. In order to train our model, we also need data with negative labels (not a service problem).

Solution

We already have a scraper for multiple sites, we can use some heuristics to see which articles we are sure that are not for public services problems. So, I implemented a whitelisting system for url patterns, in the crawler class, this is useful because primicia articles are partitioned by subject using the url path. After writing this feature, I wrote a few scripts to scrape and clean articles from subjects we know that have nothing to do with public services.
Selected subjects:

mundo
deportes
placeres
sucesos
politica
virales
ciencia y tecnología

Relevant Files:

src/c4v/scraper/crawlers/base_crawler.py = added whitelisting system
`src/c4v/scraper/crawlers/primicia_crawler.py = added primicia irrelevant url patterns
src/classifier/data_gathering = added this folder and multiple files in this folder for some scraping tasks, we can think about turn them into command line commands if we use them too often

Further work

The scraping & data cleaning scripts might be useful as CLI commands, we can add it to the CLI tool

Additional comments

I ran a training experiment to see if the classifier improves its accuracy, results:

Date: 2021-08-31 20:10:06.272265+0000
Description: Trainning with new dataset of +10k rows, using a binary classification, trying to tell if an article is irrelevant or not
EVAL METRICS:
        * eval_loss = 0.04423031955957413
        * eval_accuracy = 0.9932218707636692
        * eval_precision = 1.0
        * eval_recall = 0.9226804123711341
        * eval_f1 = 0.9597855227882037
        * eval_runtime = 112.417
        * eval_samples_per_second = 19.686
        * eval_steps_per_second = 19.686
        * epoch = 3.0
USER ARGUMENTS:
        Columns: 
                * content
        Test Dataset: training_dataset.csv
        Training Arguments:
                * per_device_train_batch_size = 3
                * per_device_eval_batch_size = 1
                * num_train_epochs = 3
                * warmup_steps = 10
                * load_best_model_at_end = True
                * save_strategy = epoch
                * evaluation_strategy = epoch
                * eval_accumulation_steps = 1

But when I test it by hand, I don't get consistent results, it's even a bit random 🤔

code-for-venezuela / c4v-py

Luis/non relevant data gathering #90

Problem

Solution

Relevant Files:

Further work

Additional comments