corona-zahlen-landkreis / corona_landkreis_fallzahlen_scraping

Scraping Germany's local districts websites for newer corona-case-numbers!
GNU General Public License v3.0
17 stars 9 forks source link

python3 CacheControl's filecache not working #45

Open debugger-zz opened 4 years ago

debugger-zz commented 4 years ago

I've added support in scrape.py to cache URLs fetched via scrape.request_url-function.

Sadly, HTTP requests are only saved for a limited number of URLs to landkreise/data/.webcache/.

In case of problems this cache might have to be cleared!

In the end it would be good if it is used and filled by every call to request_url to reduce load on webpages.

If you start a scraper with debug output, you cannot see if the cache was hit:

SCRAPER_DEBUG=yes get-somekreis.py

I've started working on CacheControl to add debug output and fix the problem. If someone knows a better replace please speak up.

My patched file_cache.py with debug output: https://github.com/corona-zahlen-landkreis/corona_landkreis_fallzahlen_scraping/blob/master-anaylse-cachecontrol-bug/landkreise/file_cache.py

debugger-zz commented 4 years ago

I think I've improved or fixed this, by adding a caching heuristic.

dadosch commented 4 years ago

I think I've improved or fixed this, by adding a caching heuristic.

Does it use the last-modified tag from the server or does it cache it so that no request is even made?