ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.32k stars 129 forks source link

wpull spends a lot of time in add_cookie_header #117

Open ivan opened 6 years ago

ivan commented 6 years ago

https://github.com/uber/pyflame + https://github.com/brendangregg/FlameGraph shows that for some crawls, 12-30% of the time is spent in add_cookie_header, which does a lot of URL parsing and clears the cookie jar too often (of which one component is a slow deepcopy).

Note: the configure step for pyflame is

PKG_CONFIG_PATH=~/.pyenv/versions/3.7.0/lib/pkgconfig ./configure

with https://github.com/uber/pyflame/pull/153 applied

flamegraphs.zip