ArchiveTeam / wpull

Wget-compatible web downloader and crawler.
GNU General Public License v3.0
545 stars 77 forks source link

Store cookies in database instead of in memory #448

Open JustAnotherArchivist opened 4 years ago

JustAnotherArchivist commented 4 years ago

Cookies are currently kept entirely in memory. There are at least three serious drawbacks of this approach:

  1. Resuming is only possible for clean stops of wpull. --save-cookies only takes effect at the end of a job, so if a wpull crawl dies for some reason (wpull crash, OS crash, power outage, etc.), there is no record of the cookies. If WARCs are written, the cookie list could in theory be extracted from those, but that's obviously a horrible idea.

  2. On large jobs spanning a lot of domains, the cookie list can quickly grow to huge sizes, blowing up the memory usage of the wpull process. I've seen AB jobs with hundreds of MB over the normal RSS, and it looked like most of that came from cookies (but I haven't properly analysed it).

  3. http.cookiejar's performance is horrible for large jobs. Having seen this a few times before, I just dove a bit into AB job m78hkg0crv4kbyy2haa0xihc, which is running very slow (only 1 req/s). Specifically, I ran py-spy on it (py-spy top --pid $(pgrep -f m78hkg0crv4kbyy2haa0xihc)). This revealed that nearly the entire CPU time (95 % after ~10k samples) is consumed by add_cookie_header (wpull/cookiewrapper.py:80), i.e. the call to http.cookiejar.CookieJar.add_cookie_header. The breakdown within that is complex and split between http.cookiejar and urllib.parse; it appears that the call tree from that function repeatedly parses URLs to determine which cookies are applicable for a request.

Instead, cookies should be kept in the database (if one is specified) with a specific interface that is more efficient than the http.cookiejar implementation of domain handling. wpull already normalises URLs/domains, so most of that processing can probably be avoided.