Right now, we only have the positive labels dataset. In order to train our model, we also need data with negative labels (not a service problem).
Solution
We already have a scraper for multiple sites, we can use some heuristics to see which articles we are sure that are not for public services problems. So, I implemented a whitelisting system for url patterns, in the crawler class, this is useful because primicia articles are partitioned by subject using the url path. After writing this feature, I wrote a few scripts to scrape and clean articles from subjects we know that have nothing to do with public services.
Selected subjects:
mundo
deportes
placeres
sucesos
politica
virales
ciencia y tecnología
Relevant Files:
src/c4v/scraper/crawlers/base_crawler.py = added whitelisting system
src/classifier/data_gathering = added this folder and multiple files in this folder for some scraping tasks, we can think about turn them into command line commands if we use them too often
Further work
The scraping & data cleaning scripts might be useful as CLI commands, we can add it to the CLI tool
Additional comments
I ran a training experiment to see if the classifier improves its accuracy, results:
Date: 2021-08-31 20:10:06.272265+0000
Description: Trainning with new dataset of +10k rows, using a binary classification, trying to tell if an article is irrelevant or not
EVAL METRICS:
* eval_loss = 0.04423031955957413
* eval_accuracy = 0.9932218707636692
* eval_precision = 1.0
* eval_recall = 0.9226804123711341
* eval_f1 = 0.9597855227882037
* eval_runtime = 112.417
* eval_samples_per_second = 19.686
* eval_steps_per_second = 19.686
* epoch = 3.0
USER ARGUMENTS:
Columns:
* content
Test Dataset: training_dataset.csv
Training Arguments:
* per_device_train_batch_size = 3
* per_device_eval_batch_size = 1
* num_train_epochs = 3
* warmup_steps = 10
* load_best_model_at_end = True
* save_strategy = epoch
* evaluation_strategy = epoch
* eval_accumulation_steps = 1
But when I test it by hand, I don't get consistent results, it's even a bit random 🤔
Problem
Right now, we only have the positive labels dataset. In order to train our model, we also need data with negative labels (not a service problem).
Solution
We already have a scraper for multiple sites, we can use some heuristics to see which articles we are sure that are not for public services problems. So, I implemented a whitelisting system for url patterns, in the crawler class, this is useful because primicia articles are partitioned by subject using the url path. After writing this feature, I wrote a few scripts to scrape and clean articles from subjects we know that have nothing to do with public services.
Selected subjects:
Relevant Files:
src/c4v/scraper/crawlers/base_crawler.py
= added whitelisting systemsrc/classifier/data_gathering
= added this folder and multiple files in this folder for some scraping tasks, we can think about turn them into command line commands if we use them too oftenFurther work
Additional comments
I ran a training experiment to see if the classifier improves its accuracy, results:
But when I test it by hand, I don't get consistent results, it's even a bit random 🤔