[SA] data prepration - Githubissues

HikkaV / Ukrainian-Reviews-Estimation

Cross-domain automative review score estimation and key phrases retrieval for Ukrainian.

1 stars 0 forks source link

[SA] data prepration #3

Open HikkaV opened 2 years ago

HikkaV commented 2 years ago

1) Translate datasets from russian and english into Ukranian. 2) Clean data. 3) Add features to data (tokenized text, lemmatized text, stemmed text). 3) Normalize and merge collected data. Note: Use custom tokenizer (nltk regex tokenizer, that excludes such symbols as (-, "'"))/ or use ukranian specific tokenizer.

HikkaV commented 2 years ago

Types of data: 1) news (not very big texts) parse new york times; 2) travel https://www.tripadvisor.ru/ https://www.booking.com/index.uk.html?aid=376445;label=bdot-KHnYBqoTN21xDbVSl4NA3QS502774686649:pl:ta:p1:p22,563,000:ac:ap:neg:fi:tikwd-334108349:lp1012835:li:dec:dm:ppccp=UmFuZG9tSVYkc2RlIyh9YTQUGSsRwx9_piJbnTYecvA;ws=&gclid=Cj0KCQjw94WZBhDtARIsAKxWG--XFVWuzrHqN8iOn9-jxKu4OnTUmwPL8bGqIs1BMu-3SCJoLCwWxV0aAhh4EALw_wcB airbnb 3) social networks: https://github.com/alexdrk14/RussoUkrainianWar_Dataset, https://github.com/alexdrk14/RussiaUkraineWar/blob/main/analysis/sentiment.py; - sample; https://alt.qcri.org/semeval2017/task4/?id=download-the-full-training-data-for-semeval-2017-task-4 - translate; https://github.com/asivokon/awesome-ukrainian-nlp; https://github.com/saganoren/ukr-twi-corpus; 4) product reviews https://github.com/Russkiy-Voyennyy-Korabl-Idi-Nakhuy/sensus/tree/main/data rozetka; metro; https://data.world/datasets/reviews

HikkaV commented 2 years ago

Steps to do: 1) Parsing, parse: rozetka, metro, airbnb, tripadvisor, booking, https://price.ua/ua. 2) Analysis of parsed data, data standarization. 3) Enrichment of data (adding other labeled data in ukranian). 4) Cope with server question.

HikkaV commented 2 years ago

Note: Sample for each dataset and per each start 300 reviews randomly, translate them to English and predict probabilities. Using charts of probabilities -> choose thresholds for classes.

HikkaV commented 2 years ago

Note: 1) for fast translation: https://huggingface.co/Helsinki-NLP/opus-mt-ru-uk 2) for spell checking: https://github.com/filyp/autocorrect