GMouYes / MaliciousBotDetection

Code for paper "Malicious Bot Detection in Online Social Networks: Arming Handcrafted Features with Deep Learning"
4 stars 1 forks source link

Preprocessing steps for Botometer datasets #1

Open diana-xie opened 4 years ago

diana-xie commented 4 years ago

Hi,

Thanks so much for posting the code to your paper. I'm interested in training the model on my own dataset and have downloaded the Botometer datasets. However, I'm not sure how to preprocess the data such that the .npy files in ".../Data/" will be ready to input into the pipeline.

Would you have some code or examples of how objects such as the WordEmb and GAFMTF are produced from Botometer csv's? Or what the table to generate these objects would look like?

Thanks so much!

Best, Diana

GMouYes commented 4 years ago

Hi Diana,

Data preprocessing can be tedious and complicated for different applications, we are glad to provide some insights, and feel free to choose the best for yours.

For text preprocessing, as described in our paper, we basically:

  1. remove punctuations, replace certain domain-specific tokens with our own special ones. This can be done by directly handle with strings.
  2. tokenize and embedding. It depends on what embedding you are using. If you are applying pre-trained embeddings such as word2vec or GloVe or FastText, then you can refer to mature libraries such as gensim, where they provide a suite of functions for importing pre-trained models and generating embedding. If you are applying transformer-based models such as BERT, please check Huggingface's transformers library. Truncate and append to a certain size that's best for your research. :)

For timeseries transformation,

  1. II map was clearly described in their paper so we implemented it on our own.
  2. for GAF or MTF, maybe pyts is your target library if you are not going to implement them on your own.

Best, Guanyi