datactive / bigbang

Scientific analysis of collaborative communities
http://datactive.github.io/bigbang/
MIT License
149 stars 52 forks source link

Handle connection break down #459

Closed Christovis closed 3 years ago

Christovis commented 3 years ago

As even single mailing lists within the 3GPP and IEEE archives are very large (e.g. a single mailing list such as 3GPP_TSG_GERAN_WG1 contains > 4k messages and can take ~1h to scrape), it can happen that the server connection breaks down before the crawling has ended, resulting in an error such as:

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='list.etsi.org', port=443): Max retries exceeded with url: /scripts/wa.exe?A2=ind0103&L=3GPP_TSG_GERAN_WG2&O=D&P=3211 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f24016bea90>: Failed to establish a new connection: [Errno -2] Name or service not known'))

Such error should ideally be captured and the already retrieved message saved.

Christovis commented 3 years ago

Issue solved with #462