OmdenaAI / trieste-italy-long-covid

GNU General Public License v3.0
9 stars 0 forks source link

choosing our sample for labeling #3

Open santarabantoosoo opened 2 years ago

santarabantoosoo commented 2 years ago

This issue is for discussion before labeling

I think due to the unbalanced nature of Long Covid and not long covid classification, we may end up with very few labels of Long Covid. Thus, we may opt for another methodology in selecting tweets to label. Instead of random sampling, we can label tweets after filtering with some keywords

Long COVID | Post-acute Sequelae of COVID-19 | PASC | COVID recovery | Post-COVID-19 Syndrome | Post Acute COVID | Long Hauler

500 with these keywords 500 without

Anything related to LONG-COVID would be labeled as positive. Either news - patients' suffering - stories, etc..

keywords taken from here

However, I am not sure if these keywords can work in Italian tweets. Or we should replace them by other Italian keywords.

@elena-andreini @EliGambicchia I couldn't assign multiple collaborators. Thus, I am mentioning you here instead of assigning a task

now-youre-gittin-it commented 2 years ago

Suggestion, would it be possible to check for an approximate regular expression? E.g. in tweets like "I have covid 3 times", can we check for "covid # times"?

santarabantoosoo commented 2 years ago

suggestion

To increase the label precision, we may need someone to label and another to review.

I believe this is not possible given the number of collaborators. Thus, I suggest using the automatically labeled tweets that are in the dataset that has XLM-T. We can label some of these tweets and compare it to after labeling to the automatic label.

Benefits: 1- double check (precision) 2- maybe the automatic label is close to perfect - close to manual labeling. Thus, we may depend on it and neglect the sentiment labeling and work only on the classification label.

elena-andreini commented 2 years ago

@now-youre-gittin-it , yes that makes sense and it is feasible.

elena-andreini commented 2 years ago

Here a few italian keywords we could use for filtering 👍

keywords = keywords = ['Long\ Covid' 'post\ CoViD', 'sindrome', 'post-covid', 'cardio', 'vascolari', 'lungo\ termine', '#LongCovid', 'cronic', 'stanchezza'] regex = re.compile(r'(?i)' + '|'.join(keywords) + r'')