cardiffnlp / xlm-t

Repository for XLM-T, a framework for evaluating multilingual language models on Twitter data
Apache License 2.0
142 stars 23 forks source link

Better preprocessing #4

Open kinoute opened 2 years ago

kinoute commented 2 years ago

Hello,

I was wondering if the preprocess function could be enhanced as right now, it strips punctuations before and after usernames/URLs. Or was it done on purpose? I couldn't find a justification of this in your paper.

Right now, the preprocess function below would convert:

I love you @louisia!!!!

to

I love you @user

# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

It seems to me that punctuations could help the model predict the sentiment of a tweet a little better if it was available to it. Another example: some users on twitter, start their tweets with a dot like this:

.@Rudy is really bad. What a shame.

They do that to avoid the reply system while still quoting a username. With the actual pre-processing function, "@Rudy" doesn't get replaced because there is a dot right before the @.

Is there any particular reason why the preprocessing function was done this way or we could try to make it more flexible in our end by keeping the punctuations next to usernames or URLs?

Thank you!