Cookies are currently kept entirely in memory. There are at least three serious drawbacks of this approach:
Resuming is only possible for clean stops of wpull. --save-cookies only takes effect at the end of a job, so if a wpull crawl dies for some reason (wpull crash, OS crash, power outage, etc.), there is no record of the cookies. If WARCs are written, the cookie list could in theory be extracted from those, but that's obviously a horrible idea.
On large jobs spanning a lot of domains, the cookie list can quickly grow to huge sizes, blowing up the memory usage of the wpull process. I've seen AB jobs with hundreds of MB over the normal RSS, and it looked like most of that came from cookies (but I haven't properly analysed it).
http.cookiejar's performance is horrible for large jobs. Having seen this a few times before, I just dove a bit into AB job m78hkg0crv4kbyy2haa0xihc, which is running very slow (only 1 req/s). Specifically, I ran py-spy on it (py-spy top --pid $(pgrep -f m78hkg0crv4kbyy2haa0xihc)). This revealed that nearly the entire CPU time (95 % after ~10k samples) is consumed by add_cookie_header (wpull/cookiewrapper.py:80), i.e. the call to http.cookiejar.CookieJar.add_cookie_header. The breakdown within that is complex and split between http.cookiejar and urllib.parse; it appears that the call tree from that function repeatedly parses URLs to determine which cookies are applicable for a request.
Instead, cookies should be kept in the database (if one is specified) with a specific interface that is more efficient than the http.cookiejar implementation of domain handling. wpull already normalises URLs/domains, so most of that processing can probably be avoided.
Cookies are currently kept entirely in memory. There are at least three serious drawbacks of this approach:
Resuming is only possible for clean stops of wpull.
--save-cookies
only takes effect at the end of a job, so if a wpull crawl dies for some reason (wpull crash, OS crash, power outage, etc.), there is no record of the cookies. If WARCs are written, the cookie list could in theory be extracted from those, but that's obviously a horrible idea.On large jobs spanning a lot of domains, the cookie list can quickly grow to huge sizes, blowing up the memory usage of the wpull process. I've seen AB jobs with hundreds of MB over the normal RSS, and it looked like most of that came from cookies (but I haven't properly analysed it).
http.cookiejar
's performance is horrible for large jobs. Having seen this a few times before, I just dove a bit into AB job m78hkg0crv4kbyy2haa0xihc, which is running very slow (only 1 req/s). Specifically, I ran py-spy on it (py-spy top --pid $(pgrep -f m78hkg0crv4kbyy2haa0xihc)
). This revealed that nearly the entire CPU time (95 % after ~10k samples) is consumed byadd_cookie_header (wpull/cookiewrapper.py:80)
, i.e. the call tohttp.cookiejar.CookieJar.add_cookie_header
. The breakdown within that is complex and split betweenhttp.cookiejar
andurllib.parse
; it appears that the call tree from that function repeatedly parses URLs to determine which cookies are applicable for a request.Instead, cookies should be kept in the database (if one is specified) with a specific interface that is more efficient than the
http.cookiejar
implementation of domain handling. wpull already normalises URLs/domains, so most of that processing can probably be avoided.