code-for-venezuela / c4v-py

3 stars 3 forks source link

Luis/non relevant data gathering #90

Closed LDiazN closed 3 years ago

LDiazN commented 3 years ago


Right now, we only have the positive labels dataset. In order to train our model, we also need data with negative labels (not a service problem).


We already have a scraper for multiple sites, we can use some heuristics to see which articles we are sure that are not for public services problems. So, I implemented a whitelisting system for url patterns, in the crawler class, this is useful because primicia articles are partitioned by subject using the url path. After writing this feature, I wrote a few scripts to scrape and clean articles from subjects we know that have nothing to do with public services.
Selected subjects:

Relevant Files:

Further work

Additional comments

I ran a training experiment to see if the classifier improves its accuracy, results:

Date: 2021-08-31 20:10:06.272265+0000
Description: Trainning with new dataset of +10k rows, using a binary classification, trying to tell if an article is irrelevant or not
        * eval_loss = 0.04423031955957413
        * eval_accuracy = 0.9932218707636692
        * eval_precision = 1.0
        * eval_recall = 0.9226804123711341
        * eval_f1 = 0.9597855227882037
        * eval_runtime = 112.417
        * eval_samples_per_second = 19.686
        * eval_steps_per_second = 19.686
        * epoch = 3.0
                * content
        Test Dataset: training_dataset.csv
        Training Arguments:
                * per_device_train_batch_size = 3
                * per_device_eval_batch_size = 1
                * num_train_epochs = 3
                * warmup_steps = 10
                * load_best_model_at_end = True
                * save_strategy = epoch
                * evaluation_strategy = epoch
                * eval_accumulation_steps = 1

But when I test it by hand, I don't get consistent results, it's even a bit random 🤔