Open santarabantoosoo opened 2 years ago
Suggestion, would it be possible to check for an approximate regular expression? E.g. in tweets like "I have covid 3 times", can we check for "covid # times"?
suggestion
To increase the label precision, we may need someone to label and another to review.
I believe this is not possible given the number of collaborators. Thus, I suggest using the automatically labeled tweets that are in the dataset that has XLM-T. We can label some of these tweets and compare it to after labeling to the automatic label.
Benefits: 1- double check (precision) 2- maybe the automatic label is close to perfect - close to manual labeling. Thus, we may depend on it and neglect the sentiment labeling and work only on the classification label.
@now-youre-gittin-it , yes that makes sense and it is feasible.
Here a few italian keywords we could use for filtering 👍
keywords = keywords = ['Long\ Covid' 'post\ CoViD', 'sindrome', 'post-covid', 'cardio', 'vascolari', 'lungo\ termine', '#LongCovid', 'cronic', 'stanchezza'] regex = re.compile(r'(?i)' + '|'.join(keywords) + r'')
This issue is for discussion before labeling
I think due to the unbalanced nature of Long Covid and not long covid classification, we may end up with very few labels of Long Covid. Thus, we may opt for another methodology in selecting tweets to label. Instead of random sampling, we can label tweets after filtering with some keywords
Long COVID | Post-acute Sequelae of COVID-19 | PASC | COVID recovery | Post-COVID-19 Syndrome | Post Acute COVID | Long Hauler
500 with these keywords 500 without
Anything related to LONG-COVID would be labeled as positive. Either news - patients' suffering - stories, etc..
keywords taken from here
However, I am not sure if these keywords can work in Italian tweets. Or we should replace them by other Italian keywords.
@elena-andreini @EliGambicchia I couldn't assign multiple collaborators. Thus, I am mentioning you here instead of assigning a task