UEWBot / dipvis

Django-based visualiser for tournaments for the boardgame Diplomacy
GNU General Public License v3.0
7 stars 5 forks source link

Scrape other sites more efficiently #246

Closed UEWBot closed 1 year ago

UEWBot commented 1 year ago

When reading from WDD, wikipedia, backstabbr, and Webdip, we should ensure that we do so as efficiently as possible.

WDD gives headers saying not to cache pages ('cache-control': 'no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'pragma': 'no-cache'), but can gzip them. Backstabbr also says not to cache pages ('cache-control': 'private'), but can also gzip them. Wediplomacy also says not to cache pages ('cache-control': 'no-store, no-cache, must-revalidate, post-check=0, pre-check=0'), but can gzip them. Wikipedia is the same ('cache-control': 'private, s-maxage=0, max-age=0, must-revalidate').

urllib doesn't ask for gzipped pages. request or httplib2 will do so, so migrating from urllib to request would help.