Queens-Hacks / qcumber-scraper

Scrapes SOLUS and generates structured data
3 stars 6 forks source link

Socket error #17

Closed Graham42 closed 9 years ago

Graham42 commented 9 years ago

This occurred when doing a scrape of just the letter C with 1 thread. Full log is here: http://pastebin.com/s3M9C5nN Scrape died at

INFO:root:----Course: 866 - Supramolecular Chemistry
INFO:root:------Term: 2013 - Fall
INFO:root:--------Section: 8360-LEC (001) -- Open

Traceback (most recent call last):
  File "/home/graham/.virtualenvs/qscraper/lib/python3.3/site-packages/requests/packages/urllib3/connectionpool.py", line 471, in urlopen
    body=body, headers=headers)
  File "/home/graham/.virtualenvs/qscraper/lib/python3.3/site-packages/requests/packages/urllib3/connectionpool.py", line 285, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib64/python3.3/http/client.py", line 1061, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib64/python3.3/http/client.py", line 1099, in _send_request
    self.endheaders(body)
  File "/usr/lib64/python3.3/http/client.py", line 1057, in endheaders
    self._send_output(message_body)
  File "/usr/lib64/python3.3/http/client.py", line 902, in _send_output
    self.send(msg)
  File "/usr/lib64/python3.3/http/client.py", line 840, in send
    self.connect()
  File "/home/graham/.virtualenvs/qscraper/lib/python3.3/site-packages/requests/packages/urllib3/connection.py", line 73, in connect
    timeout=self.timeout,
  File "/usr/lib64/python3.3/socket.py", line 417, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
socket.gaierror: [Errno -2] Name or service not known
pR0Ps commented 9 years ago

Unless it's reliably reproducible, it looks like a temporary glitch with your connection or something. Can you reproduce it?

With that being said, the scraper should definitely not throw an exception and die when the internet glitches for a second. I'm not sure exactly what it should do (retry a few times then skip it?), but it shouldn't crash.

Graham42 commented 9 years ago

Examining this log further it seems I missed that the actual cause is further down the stacktrace and looks related to some work I was doing. I think I fixed this with 62871d206ce17f864250a80284c836f3798b2845

  File "/home/graham/dev/qcumber/qcumber-scraper/parser.py", line 52, in dump_html
    f.write(self.soup.prettify().encode("utf-8"))
TypeError: must be str, not bytes

However, looking through the logs of another full scrape I did. Several threads died with

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='saself.ps.queensu.ca', port=443): Max retries exceeded with url: /psc/saself/EMPLOYEE/HRMS/c/SA_LEARNER_SERVICES.SSS_BROWSE_CATLG_P.GBL (Caused by <class 'ConnectionResetError'>: [Errno 104] Connection reset by peer)

I think this might be a case we want to handle, if the connection is lost or reset, maybe sleep and then retry?

mystor commented 9 years ago

Based on the fact that its complaining about str vs bytes, I think that it has to do with the version of python you are using. I run with python 3.3 on my scrapes, but I should also try with other versions of python (or we should choose a canonical one).

Graham42 commented 9 years ago

So I added retries with sleeping here https://github.com/Queens-Hacks/qcumber-scraper/compare/retry-requests If you like it, I can merge it into master

mystor commented 9 years ago

Maybe make MAX_RETRIES a config option?