OmdenaAI / trieste-italy-long-covid

GNU General Public License v3.0
9 stars 0 forks source link

Labeling #2

Open santarabantoosoo opened 2 years ago

santarabantoosoo commented 2 years ago

Aim: 1000 labeled tweets

sentiment analysis labeling:

-1 (negative), 0 (neutral), 1 (positive).

classification labeling

1 (Long Covid) 0 (Not Long Covid)

Definition of Long Covid

post-COVID-19 condition as having (1) any symptoms or (2) at least 1 new or persisting symptom present at 28+ days from the date of first test/diagnosis of Covid-19.

Example output

tweet_id text sent_label class_label
123 asd -1 0

We will have a pilot labeling (50 tweets per person). We will discuss and check if we need to modify/expand any definitions

@elena-andreini @EliGambicchia I couldn't assign multiple collaborators. Thus, I am mentioning you here instead of assigning a task

Claudio1729 commented 2 years ago

Hi everyone, a question about labelling: I gave a quick look at the data and it seems, as it could be expected, that many tweets are news' tweets. As a partial proof of this, I found the ratio of tweets that contained the substring 'https://t.co' for some batches, namely Batch B,D,F. Turns out that, respectively, 65%,66%,62% of tweets contain the above substring. Obviously not all tweets with this substring are news' tweets, but news' tweets almost always include external links that include the above substring, so these numbers do tell something.

Certainly some news' tweets convey a sentiment, for example when citing people's words. However the covid pandemic has been characterized by a large amount of news on the number of infections and daily reports. Tweets reporting the number of covid infections and vaccinations often convey neutral sentiment, and this could lead to a very large number of neutral tweets in a random sample. Obviously this is somehow an inherent feature of our dataset, but I was wondering whether we might give ourself some restrictions.

For example, during the labelling phase we could randomly select x% of the tweets (I would aim for a large x, like 75) and the remaining (100-x)% could be randomly selected among tweets which do not have the string 'https://t.co/...". This way we could have a more informative sample and, hopefully, not mess up too much with the distribution of tweets.

Let me know what you think. Sampling is a delicate matter, so we should think carefully about it.

santarabantoosoo commented 2 years ago

@Claudio1729 batch C @EliGambicchia batch E_1 @elena-andreini batch F
@ahmedbhabbas batch A @lucapug bacth E_2

each user 50 tweets for the pilot study