fani-lab / SEERa

A framework to predict the future user communities in a text streaming social network based on the users’ topics of interest.
Other
4 stars 5 forks source link

Big Data issues #56

Open soroush-ziaeinejad opened 2 years ago

soroush-ziaeinejad commented 2 years ago

This issue page is created to contain logs and QAs about running SEERa on huge datasets.

hosseinfani commented 2 years ago

@soroush-ziaeinejad Did you fix the problem with the two months? the data should span Nov. 1, 2010 to Jan1 2011

soroush-ziaeinejad commented 2 years ago

@soroush-ziaeinejad Did you fix the problem with the two months? the data should span Nov. 1, 2010 to Jan1 2011

Not yet. I'm working on fixing issues with data from Oct. 1st to Dec. 1st. Meanwhile, I will prepare the data from Nov. 1, 2010 to Jan1 2011

hosseinfani commented 2 years ago

Not sure I understood, is there any specific problem with data during oct1 time period that won't exist in nov1?

soroush-ziaeinejad commented 2 years ago

No specific problem. It's just because preparing csv files takes time so I decided to work on this existing dataset and optimize the code as much as possible.

soroush-ziaeinejad commented 2 years ago

for the dataset of two months of tweets, we have around 65K users. Having user graphs for all time intervals, we generate an embedded user matrix with shape (65K, dim). Applying cosine similarity to this matrix will give us a matrix of size (65K, 65K). The point is, calculating the cosine similarity cannot be done with normal NumPy arrays, and with Sparse matrices, it takes a lot of time (not even comparable to PyTorch). The best way of calculating cosine similarity is using PyTorch which returns a tensor as a result.

After that, we apply Louvain graph clustering to the result of cosine similarity which is a tensor and cannot be used directly. So far, the only way to use Louvain graph clustering on this graph is sparse representation. Since we have a tensor matrix, we should convert it to a sparse matrix which causes memory error. Now, applicable approaches for this conversion are being tested.

soroush-ziaeinejad commented 2 years ago

@hosseinfani

I decided to work with ComputeCanada since I couldn't find a way to resolve the memory error for clustering graphs in CPL. Now I keep getting this error when I want to dump a graph into a pickle file: OSError: [Errno 122] Disk quota exceeded

Do you have any idea? I searched and I found a solution but it didn't work. Another solution (which is not a good solution) is to run the code up until the end of GEL layer on my workstation and then move the generated files to ComputeCanada servers to run CPL and APL layers.

hosseinfani commented 2 years ago

@soroush-ziaeinejad @VaghehDashti I think you have to free some space as the disk quota is assigned per supervisor, I think.

soroush-ziaeinejad commented 2 years ago

@hosseinfani @VaghehDashti I think the problem is resolved for now. I'll let you know if I face it again.

soroush-ziaeinejad commented 1 year ago

@hosseinfani ,

I successfully ran SEERa on [Oct, Nov] 2010 dataset to the end of cpl layer for one combination and I got the output files. In apl layer, I faced a problem. In Model Evaluation part, it cannot aggregate mentioned news by user and it returns an empty dictionary which leads to receiving no results! Right now, I am working on this issue by tracing and debugging the code. Once I finish this, I can copy the fixed files and run the model with other configurations.

Meanwhile, I changed the dataset to [Nov, Dec] 2010 which has many more instances than [Oct. Nov] 2010. Now SEERa is running on this dataset and generating processed documents and models.

soroush-ziaeinejad commented 1 year ago

@hosseinfani I don't know why I was trying to apply cosine similarity on normal DataFrames and then make them sparse! I changed the order (first make them sparse and then apply cosine similarity) and now the whole uml layer can be run under 30 mins instead of 8-10 hours for [Nov, Dec] 2010 dataset!

Also the padding (zero topic vectors for users without tweets on each day) was super inefficient. I changed the approach and now it takes 3 seconds instead of 20 minutes for each day!

Cheers :)

hosseinfani commented 1 year ago

@soroush-ziaeinejad soroush soroush! :D

soroush-ziaeinejad commented 1 year ago

@hosseinfani

Filtering is applied on tweets with aggregated tweets (after pooling for each time interval) lower than a specific threshold. For now, the threshold is set to 10 but later we will do some experiments to find a more reasonable (or maybe relative) threshold as well as a complete justification.

For now, what I can say is we had more than 125K users for Nov. and Dec. before filtering which had more than 88K users who had only tweeted in one time interval. It means that we have a lot of inactive users in our dataset which causes malfunctioning of GEL, CPL, and APL in terms of accuracy and efficiency due to their noisy behaviour.

The problem with the [Nov., Dec.] dataset is mostly resolved after applying this filtering. APL still has some independent piece of code that reads the whole data (before filtering). I will push and comment on this issue once the problem is completely resolved.