datactive / bigbang

Scientific analysis of collaborative communities
http://datactive.github.io/bigbang/
MIT License
149 stars 52 forks source link

Improve resilience and stability of BigBang: programmatically tackle errors from the experience of scraping IFG and 3GPP lists #472

Closed maxigas closed 3 years ago

maxigas commented 3 years ago

R wrote that "I made a list of IGF mailman archives (had to leave out many closed ones and some DCs that use Google groups). However, when I scrape them I get an error message that says it cannot find a mailing list name...

Meanwhile C is churning through the 3GPP lists, some of which are really large, and some of which run on some custom or outsourced infrastructure (people's own servers or Google Groups), but even the simply large ones cause BigBang to bail out with errors, which should be documented and tracked down.

All these experiences allow us to improve the resilience and stability of BigBang.

maxigas commented 3 years ago

Current approach is to store the text/plain version of the messages, like here: https://list.etsi.org/scripts/wa.exe?A2=ind2106A&L=3GPP_TSG_CT_WG6&O=D&P=3926 -- some of these go back to 1998.

Christovis commented 3 years ago

The main error that causes some scraping of mailing lists to fail is the Error Message:

File "bigbang/bigbang/listserv.py", line 160, in from_url
    header = self._get_header_from_html(soup)
File "bigbang/bigbang/listserv.py", line 279, in _get_header_from_html
    text=re.compile(r"^\bSubject\b"),
AttributeError: 'NoneType' object has no attribute 'parent'

This appeared several times for the 3GPP_TSG_RAN_WG5_EMEET mailing list for different messages and once the error did not appear, letting me scrape the whole list without a problem. I propose a simple try: [...] except: [...] work around until we find a better solution.

Christovis commented 3 years ago

Fixed with #476