MELALab / nela-gt

Repository for the NELA dataset
https://melalab.github.io/
17 stars 0 forks source link

Data preprocessing details #4

Open artidoro opened 2 years ago

artidoro commented 2 years ago

Could you give more details on how you preprocess the data? I noticed underscore characters are present instead of some special characters, for example.

It would be ideal if you could share the code you used to preprocess the data. I am comparing another dataset to NELA and I need to apply the same preprocessing steps to make sure the discriminators don't pick up preprocessing differences between the datasets.

Thank you for your help!

BenjaminDHorne commented 2 years ago

Hi,

There are no preprocessing steps. The data for each outlet is directly from thier RSS feed. If there are any artifacts, then it is from the outlet itself, not the collection process.

Ben

On Wed, Apr 6, 2022, 5:28 PM Artidoro Pagnoni @.***> wrote:

Could you give more details on how you preprocess the data? I noticed underscore characters are present instead of some special characters, for example.

It would be ideal if you could share the code you used to preprocess the data. I am comparing another dataset to NELA and I need to apply the same preprocessing steps to make sure the discriminators don't pick up preprocessing differences between the datasets.

Thank you for your help!

— Reply to this email directly, view it on GitHub https://github.com/MELALab/nela-gt/issues/4, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCLQZQM4JKV4OJ74B7EI2DVDX6WPANCNFSM5SXLS7BA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

artidoro commented 2 years ago

Thank you for the information!

Following up, when you add the "@" signs, what tokenizer do you use? Are you splitting on white spaces or do you use ntk or spacy?

artidoro commented 2 years ago

It seems like the text you provide is tokenized, for example: "Here 's", and "it ’ s" have spaces. Also, there are spaces between words and punctuation which are not stylistically common. Do you have any hunch on how these things came to be?

mgruppi commented 2 years ago

Hi @artidoro. After looking at your questions I believe there might be some points that need clarification. Yes, you are right, there is some tokenization happening, we need that to replace words with @. We apply NLTK's word_tokenize to the raw input text. Hope this clarifies it!