BunsenFeng / BotRGCN

Code listing for the paper 'BotRGCN: Twitter Bot Detection with Relational Graph Convolutional Networks'. ASONAM 2021.
MIT License
30 stars 7 forks source link

Data Preprocessing hardware requirements. #2

Closed bsexpa069380 closed 2 years ago

bsexpa069380 commented 2 years ago

I tried to run through the data process codes with Google Colab Pro, yet even with the GPU and 24GB memories provided, I still ran out of memory constantly.

May I know what hardware sets are you guys using and if there is any possible solution on my issue ?

KareemAlaa2001 commented 2 years ago

Are you running out of memory while generating the description embedding or the tweet embedding?

What I would suggest you do right away is call .detach() on any of the tensors you get as output from the call to feature_extract() in Dataset.py.

Especially in the case of the tweets, you're adding all of these to a tensor which is averaged for every user at the end, so Pytorch keeps a hold of the computational graphs of all these tensors. You don't need any backprop over the steps in the roberta pipeline anyways, so making this call detaches the graphs to make sure your overall graph doesn't explode in size.

If it still doesn't work, I'd recommend checking if you can run through a full user before your error. In case you are, you can materialise to disk after every X users (depending on how many your setup can handle before breaking) and just pick up where you left off at each run.

bsexpa069380 commented 2 years ago

I upgraded to Colab Pro Plus and ran the data Process code again. However, when the code hits "Running feature2 embedding ", it became painfully slow (>150 hour to finish that step). I am wondering if there is any solution to that problem. Or should I adjust the codes in the Twibot20 class ?

KareemAlaa2001 commented 2 years ago

Yeah the embedding generation code in this repo is just super slow, it doesn't actually parallelize the calls to pipeline, instead iterating and adding them to a python list before averaging across that. Your best bet would be to modify the code to parallelize these calls somehow, maybe pad out the list of tweets with empty strings to the maximum length for each user, concatenate and flatten all of these into a big tensor, call the embeddings on that whole tensor then reshape it in the end.

BunsenFeng commented 2 years ago

Alternatively, I can have someone upload the generated embeddings to Google Drive so you can directly download them. Let me know if this works for you.

bsexpa069380 commented 2 years ago

@BunsenFeng That will be a great help, Thanks a lot !

@KareemAlaa2001 It seems to be a bit complicate for me, but I will try to figure that out too.

BunsenFeng commented 2 years ago

I asked @leopoldwhite to upload generated embeddings to Google Drive. Let me know if you have trouble downloading it.

leopoldwhite commented 2 years ago

I asked @leopoldwhite to upload generated embeddings to Google Drive. Let me know if you have trouble downloading it.

This zip file includes four generated embeddings

and edge_index.pt, edge_type.pt, label.pt

bsexpa069380 commented 2 years ago

I believe that the data processing issue is solved now. Since I can run through the code and it printed "Building graph Finished", which is the last print in the Twibot20 class.

Yet there is another error occurs, the code stops and shows this instead. RuntimeError: mat1 and mat2 shapes cannot be multiplied (229580x5 and 6x32)

I thought it might be the differences between the size of the data and the size of the model taking ? image

leopoldwhite commented 2 years ago

I believe that the data processing issue is solved now. Since I can run through the code and it printed "Building graph Finished", which is the last print in the Twibot20 class.

Yet there is another error occurs, the code stops and shows this instead. RuntimeError: mat1 and mat2 shapes cannot be multiplied (229580x5 and 6x32)

I thought it might be the differences between the size of the data and the size of the model taking ? image

The above embeddings were generated from the adjusted Twibot20 dataset for our new work, where some of the numerical features and categorical feature are not reserved. More details here.

One simple method is to adjust the according input sizes of the model like this: image

Or you can rewrite and run the num_prop_preprocess() and cat_prop_preprocess() parts of Dataset.py to get the embeddings of numerical properties and categorical properties of the original Twibot20 dataset. (these two parts are not as time-consuming as the description/tweet embedding parts)