Closed Graham42 closed 9 years ago
Unless it's reliably reproducible, it looks like a temporary glitch with your connection or something. Can you reproduce it?
With that being said, the scraper should definitely not throw an exception and die when the internet glitches for a second. I'm not sure exactly what it should do (retry a few times then skip it?), but it shouldn't crash.
Examining this log further it seems I missed that the actual cause is further down the stacktrace and looks related to some work I was doing. I think I fixed this with 62871d206ce17f864250a80284c836f3798b2845
File "/home/graham/dev/qcumber/qcumber-scraper/parser.py", line 52, in dump_html
f.write(self.soup.prettify().encode("utf-8"))
TypeError: must be str, not bytes
However, looking through the logs of another full scrape I did. Several threads died with
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='saself.ps.queensu.ca', port=443): Max retries exceeded with url: /psc/saself/EMPLOYEE/HRMS/c/SA_LEARNER_SERVICES.SSS_BROWSE_CATLG_P.GBL (Caused by <class 'ConnectionResetError'>: [Errno 104] Connection reset by peer)
I think this might be a case we want to handle, if the connection is lost or reset, maybe sleep and then retry?
Based on the fact that its complaining about str
vs bytes
, I think that it has to do with the version of python you are using. I run with python 3.3 on my scrapes, but I should also try with other versions of python (or we should choose a canonical one).
So I added retries with sleeping here https://github.com/Queens-Hacks/qcumber-scraper/compare/retry-requests If you like it, I can merge it into master
Maybe make MAX_RETRIES a config option?
This occurred when doing a scrape of just the letter C with 1 thread. Full log is here: http://pastebin.com/s3M9C5nN Scrape died at