Closed maxigas closed 3 years ago
Current approach is to store the text/plain version of the messages, like here: https://list.etsi.org/scripts/wa.exe?A2=ind2106A&L=3GPP_TSG_CT_WG6&O=D&P=3926 -- some of these go back to 1998.
The main error that causes some scraping of mailing lists to fail is the Error Message:
File "bigbang/bigbang/listserv.py", line 160, in from_url
header = self._get_header_from_html(soup)
File "bigbang/bigbang/listserv.py", line 279, in _get_header_from_html
text=re.compile(r"^\bSubject\b"),
AttributeError: 'NoneType' object has no attribute 'parent'
This appeared several times for the 3GPP_TSG_RAN_WG5_EMEET mailing list for different messages and once the error did not appear, letting me scrape the whole list without a problem.
I propose a simple try: [...] except: [...]
work around until we find a better solution.
Fixed with #476
R wrote that "I made a list of IGF mailman archives (had to leave out many closed ones and some DCs that use Google groups). However, when I scrape them I get an error message that says it cannot find a mailing list name...
Meanwhile C is churning through the 3GPP lists, some of which are really large, and some of which run on some custom or outsourced infrastructure (people's own servers or Google Groups), but even the simply large ones cause BigBang to bail out with errors, which should be documented and tracked down.
All these experiences allow us to improve the resilience and stability of BigBang.