Closed nllz closed 2 years ago
This is probably because Pipermail was discontinued in IETF in 2009 and the current archives that are being used are MHonArc, which might not be supported (yet) by Bigbang?
That sounds possible. I've never heard of MHonArc.
But the error you are getting is a unicode decoding error, which unfortunately is a major headache and could be do to some strange encoding in a particular email in a particular archive.
To debug this, the best thing would be to isolate the problematic data as much as possible. It looks like the script is working for most of the data download. It's a problem with parsing.
Is it a problem with happens with every archive, or just some of them?
If you can send me a sample (e.g. 2012-11.mail ) I can try to test it locally.
You're totally right, it only happens with several archives. This is the problem sample: https://www.ietf.org/mail-archive/text/16ng/2012-11.mail
Might it be an idea to make a slight change to that the mass import script so that it would not stop when it encounters a problem, but continue importing and provide error messages at the end on which mailinglists it wasn't able to import?
Same problem while importing ICANN mailinglists https://raw.githubusercontent.com/nllz/bigbang/master/examples/mm.icann.org.txt . I think we should try to find a structural solution, because doing this by hand would be too much work imho if we want to analyze a lot of mailinglists.
Running things with Python 3 should fix things, are there any Python 2 only libraries which BigBang depends on?
Good point @hargup . See #226 , the Python 3 upgrade ticket.
I think the Python 3 upgrade is out of scope for the 0.2 release. So I'm going to punt this from the current milestone.
tried to mass import all IETF public mailinglists with:
$ python bin/collect_mail.py -f mm.ietf.org.txt
but got:
eventhough it's a mailman archive.