athalhammer / danker

Compute PageRank on >3 billion Wikipedia links on off-the-shelf hardware.
GNU General Public License v3.0
56 stars 4 forks source link

Re-use links bziped dump file #17

Open Benja1972 opened 3 years ago

Benja1972 commented 3 years ago

Hello, Thank you for nice tool. I have one question about how to run danker on links file which already downloaded and processed by danker?

I run "./danker.sh ALL --bigmem" and after few hours it was crushed with memory issue but bziped file of links were created. How I can reuse this file to calculate only PageRank?

Thank you! Sergei

athalhammer commented 3 years ago

Hi Sergei,

Thanks a lot for your question! So the best option would be to run the following:

 bunzip2 filename.links.bz2
 python3 -m danker filename.links 0.85 40 0.1 -i | sed "s/\(.*\)/Q\1/" > output.rank
 sort -k 2,2nr -T . -S 50% -o output.rank output.rank

You would need to make sure that the machine has enough main memory available this time.

An alternative would be the following:

 bunzip2 filename.links.bz2
 sort -k 2,2n -T . -S 50% -o filename.links.right filename.links
 python3 -m danker filename.links -r filename.links.right 0.85 40 0.1 -i | sed "s/\(.*\)/Q\1/" > output.rank
 sort -k 2,2nr -T . -S 50% -o output.rank output.rank

This takes a bit longer but the memory footprint should be less than 8GB.

Let me know which option worked for you!

Benja1972 commented 3 years ago

Hi Andreas, Thank you for your answer. I will try it out. Would be nice to have a predefined bash script which does same for any output of link collector just in case .

Best Sergei

athalhammer commented 3 years ago

Hmm, let me think on the best option how to separate this out form the workflow script... https://github.com/athalhammer/danker/blob/20cc2b7b1fe5d937ea5204d214a074baf3400c93/script/dank.sh#L106

Benja1972 commented 3 years ago

Thank you @athalhammer ! It works for me. I have tested codes of lines you provide early. Sergei

athalhammer commented 3 years ago

So I thought about it and I came to the conclusion that wrapping these three lines in a designated script would be overkill.