get_coord_shares running on parallel taking too long

luisestradasd commented 4 years ago

Hello, once again, such a great tool!!! Thank you so much for working and maintaining this!

Im running get_coord_shares but about 33 hours and it had only progressed 20%

Im running the functions with this parameters: parallel = TRUE, keep_ourl_only = FALSE. The dataframe for ct_shares.df contains 692701 objets (I know its a bit big, tho).

The question is: is this normal? Id ran this in the past and it usually takes a couple of hours, even less. Im worried that maybe my R engine might be corrupted and is cycling the execution of the get_coord_shares function or something.

fabiogiglietto commented 4 years ago

Hi Luise, 20% in 33 hours is not normal even for a large ct_shares. Can you take a look to the log file to see what was the estimated coordination interval?

luisestradasd commented 4 years ago

Here are the logs:

q (quantile of quickest URLs to be filtered): 0.1 p (percentage of total shares to be reached): 0.5 coordination interval from estimate_coord_interval: 12 secs

And these are the results from the get_ctshares script: Original URLs: 29251 CT shares: 692701 Unique URLs in CT shares: 153314 Link in CT shares matching original URLs: 41684

Is not that big dataset.

I think my context got corrupted. And I have the feeling that the best is to reboot the computer and start over.

Let me know what you think

fabiogiglietto commented 4 years ago

Most of the times, slow performances and sometimes r crashes on large ct_shares are caused by a shortage of ram. When such shortage occurs, the process become really slow. Coordination intervals also matters because the shorter the interval the larger is the number of slots analyzed the each url. Anyway, I share your opinion about the reboot because it is very likely that r will crash before completion.

fabiogiglietto commented 4 years ago

Hi again, did you managed to finish the analysis?

luisestradasd commented 4 years ago

Hello.

Actually no. But we noticed several interesting things:

For the over-million CT result dataframe, in parallel, it takes around 3-4 hours per 1% processed.
we created a smaller ct dataframe and without parallelism it takes around 4, but the get_coord_shares method gets stuck after the grouping loop ends. My theory is this get stuck at the moment of creating the dataframe from the datalist.

Right now, Im running just the part of the loop, and the plan is execute the code line by line to verify where it hangs.

If you have any advice or tip, I will totally appreciate it :)

Luis

Luis Estrada Founders Workshop Sent with Outlookhttps://aka.ms/qtex0l iOS

From: Fabio Giglietto notifications@github.com Sent: Sunday, August 9, 2020 11:52:43 AM To: fabiogiglietto/CooRnet CooRnet@noreply.github.com Cc: Luis Estrada luis.estrada@foundersworkshop.com; Author author@noreply.github.com Subject: Re: [fabiogiglietto/CooRnet] get_coord_shares running on parallel taking too long (#21)

Hi again, did you managed to finish the analysis?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/fabiogiglietto/CooRnet/issues/21#issuecomment-671075281, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMA6YFLCSDLHPNVJWIOMJETR73H5XANCNFSM4PT2MEAQ.

luisestradasd commented 4 years ago

I dissected the get_coord_shares function and tried to run it step by step to identify where it hangs. Seems that my code hangs in this line: df <- dplyr::bind_rows(datalist)

After the loop that process groups the urls.

The datalist generated by the loop if a large list with 50620 elements (68.7 GB).

Right now, Im testing the old version of this code, without dplyr, but it seems to be the same result.

I'll keep you posted

fabiogiglietto commented 4 years ago

Can you check the memory consumption while executing that code?

luisestradasd commented 4 years ago

RStudio is using in average 12GB in Ram and 24% of CPU.

As a side note, the code I mentioned 10 hours ago is still running

fabiogiglietto commented 4 years ago

What % of your total ram is 12GB? That line basically attempts to bind 50620 dataframes (each related to one of your URLs) in one. Sometimes the resulting dataframe do not fit into memory.

luisestradasd commented 4 years ago

I have 32 GB of RAM. But the OS takes around 2 GB for itself.

Funny thing, we ran the same code on an AWS server with 120 GB and the result was pretty much the same

Luis Estrada Founders Workshop Sent with Outlookhttps://aka.ms/qtex0l iOS

From: Fabio Giglietto notifications@github.com Sent: Monday, August 10, 2020 10:23:03 AM To: fabiogiglietto/CooRnet CooRnet@noreply.github.com Cc: Luis Estrada luis.estrada@foundersworkshop.com; Author author@noreply.github.com Subject: Re: [fabiogiglietto/CooRnet] get_coord_shares running on parallel taking too long (#21)

What % of your total ram is 12GB? That line basically attempts to bind 50620 dataframes (each related to one of your URLs) in one. Sometimes the resulting dataframe do not fit into memory.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/fabiogiglietto/CooRnet/issues/21#issuecomment-671421034, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMA6YFKOTIGNGWISLZ4CZUDSAAGFPANCNFSM4PT2MEAQ.

fabiogiglietto commented 4 years ago

Hi again, can you try binding the list of dataframes by using rbindlist() from data.table package?

luisestradasd commented 4 years ago

Hello, we managed to solve the issue, but we had to reduce the size of the CT data, reducing it to 10 unique links and setting th eisourl = TRUE. We used a 15 seconds threshold.

We ran it in a 40-cores, 95B RAM VM in AWS. It ran in over 30 minutes this way.

I will modify my code to use your suggestion with the original dataframe, it was around 700K total posts

Thank you so much for everything

From: Fabio Giglietto notifications@github.com Sent: Tuesday, August 11, 2020 3:31 AM To: fabiogiglietto/CooRnet CooRnet@noreply.github.com Cc: Luis Estrada luis.estrada@foundersworkshop.com; Author author@noreply.github.com Subject: Re: [fabiogiglietto/CooRnet] get_coord_shares running on parallel taking too long (#21)

Hi again, can you try binding the list of dataframes by using rbindlist() from data.table package?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/fabiogiglietto/CooRnet/issues/21#issuecomment-671809098, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMA6YFKNMPS7CN6C3MLC2DTSAD6UNANCNFSM4PT2MEAQ.

fabiogiglietto commented 4 years ago

Glad to hear it worked. ourl speeds up the process because it only takes into account URLs from your original list (CT returns posts with your link, but sometimes posts has more than one link - in the text - and this posts are also returned).

I'll keep this open to hear back about your data.table experiment. Let me know how it goes. Best, Fabio

fabiogiglietto commented 4 years ago

Hi Luise, during the last few weeks we pushed several updates aimed at improving the speed of CooRNet::get_coord_shares. I've also noticed that sometimes the function is surprisingly slow on aws spot requests. It would be great if you can confirm the speed improvement to let me eventually close this issue.

luisestradasd commented 3 years ago

Hello, sorry for the reaaally late response. I changed work and had to move away from this project, but I will take some time to verify this change.

Thank you so much for maintaining this çLuis

fabiogiglietto / CooRnet

get_coord_shares running on parallel taking too long #21