tumblr_backup: Add --save-notes and --cookies options

cebtenzzre commented 5 years ago

I've tested this with only one post so far, but it seems to work alright. It brings in a lot of new code and dependencies, but it's all optional. This also allows youtube-dl to use the cookies that are provided.

bbolli commented 5 years ago

I did a very quick review inline. More importantly: is it possible to move the part starting at line 2075 into class WebCrawler and maybe the whole class into a separate file?

cebtenzzre commented 5 years ago

The and web_crawler is so that web_crawler can fail to load (line 1385) and the script can still continue. Now that I think about it, it's probably better if we just throw an exception.

bbolli commented 5 years ago

Yes, better not let the user think the notes were saved when the crawler can't initialize.

cebtenzzre commented 5 years ago

So far I've removed the exception-silencing and moved more code into WebCrawler.

cebtenzzre commented 5 years ago

D'oh, I somehow just realized that cookielib is an official Python module.

cebtenzzre commented 5 years ago

The class is in its own file now. I hope I did that right xD

bbolli commented 5 years ago

You can also move the import checks into the module, then check if an import succeeded with if web_crawler.bs4, e.g. You can also do an additional commit for this, so the evolution can be seen later.

cebtenzzre commented 5 years ago

I'm now trying to figure out how to get it to cleanly exit on SIGINT. Currently it throws a lot of exceptions.

cebtenzzre commented 5 years ago

All of the obvious issues have now been fixed.

cebtenzzre commented 5 years ago

Well, it still seems to (sometimes) either ignore SIGINT, or hang when it's sent. Luckily SIGQUIT (Ctrl+\) still works.

cebtenzzre commented 5 years ago

I've somehow only just realized that JavaScript (and thusly Selenium) isn't necessary for this. In my testing, I must have been modifying too many experimental variables at once.

cebtenzzre commented 5 years ago

Alright, now it actually gets the right notes (I think all of the Selenium requests were simultaneous and on the same virtual tab before c70f09a) and it doesn't eat up an insane amount of CPU anymore. Those are good things.

It's still having issues with SIGINT, and it still doesn't get all of the notes (at least according to the note count -- is that accurate?), but it's good enough for me to use now.

cebtenzzre commented 5 years ago

So, locally I have a version that runs the web crawler as a subprocess. The reasoning behind this is that Python threads aren't real threads, because of the GIL. And we do a lot of work in the web crawler that would probably be better off if it were actually running in parallel (HTML parsing, waiting on HTTP requests, and looping back to get the next set of notes). I first tried to use IronPython, which doesn't have a GIL, but after building my own AUR package just to get it to build and import modules, I found that it won't support lxml without Ironclad, and Ironclad will only build properly on Windows.

With -k --save-notes -s 200 -n 200 --cookies <cookiefile> on one blog, I compared the execution time of each (note the value of real).

As a module:

real    107.98s
user    98.57s
sys 34.58s

As a subprocess:

real    41.27s
user    71.70s
sys 8.17s

So this is more than a 2x speedup of the overall execution (including everything outside of the crawler). The only downside is that it uses more RAM and CPU, due to running multiple interpreter instances.

Should we make this the default, or perhaps an option?

cebtenzzre commented 5 years ago

This is extremely out of date now. I've been doing a lot of work on a few local forks. If there's ever further interest in this, I will organize and publish my current code.

bbolli / tumblr-utils

tumblr_backup: Add --save-notes and --cookies options #189