NyanNyanovich / nyan

Automatic news aggregator in Telegram / Автоматический агрегатор новостей в Телеграме
https://t.me/nyannews
Apache License 2.0
184 stars 27 forks source link

How to add new category? #19

Open b0tm1nd opened 8 months ago

b0tm1nd commented 8 months ago

What is the scenario for adding a new category ?

b0tm1nd commented 8 months ago

From what I understood, we need a new dataset in .jsonl with text and labels. Could you share datasets that this was trained on? Especially for not_news. By reading the telegram contest I see that for russian content they mostly used lenta.ru archive. But what about ukrainian?

NyanNyanovich commented 8 months ago

Here you go: https://github.com/NyanNyanovich/nyan/releases/download/can_annot/cat_markup.tar.gz I used Lenta and gpt-4 annotations, here is the script to query gpt-4: https://github.com/NyanNyanovich/nyan/blob/master/scripts/annotate_categories.py And the training script: https://github.com/NyanNyanovich/nyan/blob/master/scripts/train_clf.py

b0tm1nd commented 8 months ago

@NyanNyanovich Thanks, I have found train_clf.py already and tried to train it with a single category but then on send.sh classificator failed probably because of "not_news" missing..

I have taken a dataset for Ukrainian news website which tagged their news, grouped only related to corruption and gotten about 700 entries which I united with categories_train.jsonl.

And after training I've became getting much worse results: many from war/politics became triggering corruption now and resulting as "unknown". I have found out that in the added dataset the median text size is 1000+ characters when in yours about 450.

So I have a few questions about the hints for a dataset for the new category:

  1. Does smaller article size improves accuracy?
  2. Do multiple labels for the new category (like ["corruption", "war"] or ["corruption", "politics"]) will increase accuracy?
  3. What was your strategy (or was it random?) in news selection for your training dataset:

Labels sorted by Count: politics: 1200 occurrences war: 1062 occurrences economy: 760 occurrences incident: 699 occurrences not_news: 451 occurrences entertainment: 426 occurrences tech: 418 occurrences sports: 324 occurrences science: 138 occurrences other: 37 occurrences

  1. What are the other hints you might suggest?