Preprocessing steps for Botometer datasets

Hi Diana,

Data preprocessing can be tedious and complicated for different applications, we are glad to provide some insights, and feel free to choose the best for yours.

For text preprocessing, as described in our paper, we basically:

remove punctuations, replace certain domain-specific tokens with our own special ones. This can be done by directly handle with strings.
tokenize and embedding. It depends on what embedding you are using. If you are applying pre-trained embeddings such as word2vec or GloVe or FastText, then you can refer to mature libraries such as gensim, where they provide a suite of functions for importing pre-trained models and generating embedding. If you are applying transformer-based models such as BERT, please check Huggingface's transformers library. Truncate and append to a certain size that's best for your research. :)

For timeseries transformation,

II map was clearly described in their paper so we implemented it on our own.
for GAF or MTF, maybe pyts is your target library if you are not going to implement them on your own.

Best, Guanyi

GMouYes / MaliciousBotDetection

Preprocessing steps for Botometer datasets #1