datactive / bigbang

Scientific analysis of collaborative communities
http://datactive.github.io/bigbang/
MIT License
154 stars 51 forks source link

Unable to import mailman archives because of unicode issues #250

Closed nllz closed 2 years ago

nllz commented 8 years ago

tried to mass import all IETF public mailinglists with:

$ python bin/collect_mail.py -f mm.ietf.org.txt

but got:

$ python bin/collect_mail.py -f ~/pwnCloud/Phd/mm.ietf.org.txt
'Getting archive page for 16ng'
['2006-07.mail',
 '2006-08.mail',
 '2006-09.mail',
 '2006-10.mail',
 '2006-11.mail',
 '2006-12.mail',
 '2006-2005.mail',
 '2007-01.mail',
 '2007-02.mail',
 '2007-03.mail',
 '2007-04.mail',
 '2007-05.mail',
 '2007-06.mail',
 '2007-07.mail',
 '2007-08.mail',
 '2007-09.mail',
 '2007-10.mail',
 '2007-11.mail',
 '2007-12.mail',
 '2008-01.mail',
 '2008-02.mail',
 '2008-03.mail',
 '2008-04.mail',
 '2008-05.mail',
 '2008-06.mail',
 '2008-07.mail',
 '2008-08.mail',
 '2008-09.mail',
 '2008-10.mail',
 '2008-11.mail',
 '2008-12.mail',
 '2009-01.mail',
 '2009-02.mail',
 '2009-03.mail',
 '2009-04.mail',
 '2009-05.mail',
 '2009-06.mail',
 '2009-07.mail',
 '2009-08.mail',
 '2009-09.mail',
 '2009-10.mail',
 '2009-11.mail',
 '2010-06.mail',
 '2010-07.mail',
 '2010-08.mail',
 '2012-11.mail']
unzipping 0 archive files
Opening 46 archive files
Traceback (most recent call last):
  File "bin/collect_mail.py", line 41, in <module>
    main(sys.argv[1:])
  File "bin/collect_mail.py", line 38, in main
    mailman.collect_from_file(arg)
  File "/home/gogol/Data/bigbang/bigbang/mailman.py", line 94, in collect_from_file
    collect_from_url(url)
  File "/home/gogol/Data/bigbang/bigbang/mailman.py", line 82, in collect_from_url
    data = open_list_archives(url)
  File "/home/gogol/Data/bigbang/bigbang/mailman.py", line 239, in open_list_archives
    return messages_to_dataframe(messages)
  File "/home/gogol/Data/bigbang/bigbang/mailman.py", line 284, in messages_to_dataframe
    for m in messages if m.get('Message-ID')]
  File "/home/gogol/Data/bigbang/bigbang/mailman.py", line 252, in get_text
    text = unicode(part.get_payload(decode=True), str(charset), "ignore")
LookupError: unknown encoding: x-unknown

eventhough it's a mailman archive.

nllz commented 8 years ago

This is probably because Pipermail was discontinued in IETF in 2009 and the current archives that are being used are MHonArc, which might not be supported (yet) by Bigbang?

sbenthall commented 8 years ago

That sounds possible. I've never heard of MHonArc.

But the error you are getting is a unicode decoding error, which unfortunately is a major headache and could be do to some strange encoding in a particular email in a particular archive.

To debug this, the best thing would be to isolate the problematic data as much as possible. It looks like the script is working for most of the data download. It's a problem with parsing.

Is it a problem with happens with every archive, or just some of them?

If you can send me a sample (e.g. 2012-11.mail ) I can try to test it locally.

nllz commented 8 years ago

You're totally right, it only happens with several archives. This is the problem sample: https://www.ietf.org/mail-archive/text/16ng/2012-11.mail

Might it be an idea to make a slight change to that the mass import script so that it would not stop when it encounters a problem, but continue importing and provide error messages at the end on which mailinglists it wasn't able to import?

nllz commented 8 years ago

Same problem while importing ICANN mailinglists https://raw.githubusercontent.com/nllz/bigbang/master/examples/mm.icann.org.txt . I think we should try to find a structural solution, because doing this by hand would be too much work imho if we want to analyze a lot of mailinglists.

hargup commented 8 years ago

Running things with Python 3 should fix things, are there any Python 2 only libraries which BigBang depends on?

sbenthall commented 8 years ago

Good point @hargup . See #226 , the Python 3 upgrade ticket.

I think the Python 3 upgrade is out of scope for the 0.2 release. So I'm going to punt this from the current milestone.