Open HikkaV opened 2 years ago
Types of data: 1) news (not very big texts) parse new york times; 2) travel https://www.tripadvisor.ru/ https://www.booking.com/index.uk.html?aid=376445;label=bdot-KHnYBqoTN21xDbVSl4NA3QS502774686649:pl:ta:p1:p22,563,000:ac:ap:neg:fi:tikwd-334108349:lp1012835:li:dec:dm:ppccp=UmFuZG9tSVYkc2RlIyh9YTQUGSsRwx9_piJbnTYecvA;ws=&gclid=Cj0KCQjw94WZBhDtARIsAKxWG--XFVWuzrHqN8iOn9-jxKu4OnTUmwPL8bGqIs1BMu-3SCJoLCwWxV0aAhh4EALw_wcB airbnb 3) social networks: https://github.com/alexdrk14/RussoUkrainianWar_Dataset, https://github.com/alexdrk14/RussiaUkraineWar/blob/main/analysis/sentiment.py; - sample; https://alt.qcri.org/semeval2017/task4/?id=download-the-full-training-data-for-semeval-2017-task-4 - translate; https://github.com/asivokon/awesome-ukrainian-nlp; https://github.com/saganoren/ukr-twi-corpus; 4) product reviews https://github.com/Russkiy-Voyennyy-Korabl-Idi-Nakhuy/sensus/tree/main/data rozetka; metro; https://data.world/datasets/reviews
Steps to do: 1) Parsing, parse: rozetka, metro, airbnb, tripadvisor, booking, https://price.ua/ua. 2) Analysis of parsed data, data standarization. 3) Enrichment of data (adding other labeled data in ukranian). 4) Cope with server question.
Note: Sample for each dataset and per each start 300 reviews randomly, translate them to English and predict probabilities. Using charts of probabilities -> choose thresholds for classes.
Note: 1) for fast translation: https://huggingface.co/Helsinki-NLP/opus-mt-ru-uk 2) for spell checking: https://github.com/filyp/autocorrect
1) Translate datasets from russian and english into Ukranian. 2) Clean data. 3) Add features to data (tokenized text, lemmatized text, stemmed text). 3) Normalize and merge collected data. Note: Use custom tokenizer (nltk regex tokenizer, that excludes such symbols as (-, "'"))/ or use ukranian specific tokenizer.