datactive / bigbang

Scientific analysis of collaborative communities
http://datactive.github.io/bigbang/
MIT License
149 stars 52 forks source link

collect_mail.py trips over non-ASCII character #428

Closed nllz closed 3 years ago

nllz commented 3 years ago

example of non-ascii character: python3 bin/collect_mail.py -u https://ietf.org/mail-archive/text/eap/ [...] INFO:root:200 - writing file to /home/ubuntu/bigbang/archives/eap/2012-06.mail INFO:root:retrieving https://ietf.org/mail-archive/text/eap/2012-07.mail INFO:root:200 - writing file to /home/ubuntu/bigbang/archives/eap/2012-07.mail /home/ubuntu/bigbang/bigbang/mailman.py:262: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. provenance = yaml.load(file_handle) INFO:root:Updated provenance file in /home/ubuntu/bigbang/archives/eap INFO:root:Unzipping 0 archive files INFO:root:Opening 126 archive files Traceback (most recent call last): File "bin/collect_mail.py", line 54, in main(args) File "bin/collect_mail.py", line 45, in main mailman.collect_from_url(args.u, notes=notes) File "/home/ubuntu/bigbang/bigbang/mailman.py", line 112, in collect_from_url data = open_list_archives(url) File "/home/ubuntu/bigbang/bigbang/mailman.py", line 435, in open_list_archives arch = [list(mailbox.mbox(txt, create=False).values()) for txt in txts] File "/home/ubuntu/bigbang/bigbang/mailman.py", line 435, in arch = [list(mailbox.mbox(txt, create=False).values()) for txt in txts] File "/usr/lib/python3.8/mailbox.py", line 119, in values return list(self.itervalues()) File "/usr/lib/python3.8/mailbox.py", line 109, in itervalues value = self[key] File "/usr/lib/python3.8/mailbox.py", line 73, in getitem return self.get_message(key) File "/usr/lib/python3.8/mailbox.py", line 781, in get_message msg.set_from(from_line[5:].decode('ascii')) UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 25: ordinal not in range(128)

nllz commented 3 years ago

example of tzdata: python3 bin/collect_mail.py -u https://ietf.org/mail-archive/text/ietf/ [...]

INFO:root:200 - writing file to /home/ubuntu/bigbang/archives/ietf/2021-02.mail INFO:root:retrieving https://ietf.org/mail-archive/text/ietf/2021-03.mail INFO:root:200 - writing file to /home/ubuntu/bigbang/archives/ietf/2021-03.mail /home/ubuntu/bigbang/bigbang/mailman.py:262: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. provenance = yaml.load(file_handle) INFO:root:Updated provenance file in /home/ubuntu/bigbang/archives/ietf INFO:root:Unzipping 0 archive files INFO:root:Opening 272 archive files /home/ubuntu/anaconda3/envs/bigbang/lib/python3.7/site-packages/dateutil/parser/_parser.py:1218: UnknownTimezoneWarning: tzname WET identified but not understood. Pass tzinfos argument in order to correctly return a timezone-aware datetime. In a future version, this will raise an exception. category=UnknownTimezoneWarning) /home/ubuntu/anaconda3/envs/bigbang/lib/python3.7/site-packages/dateutil/parser/_parser.py:1218: UnknownTimezoneWarning: tzname EDT identified but not understood. Pass tzinfos argument in order to correctly return a timezone-aware datetime. In a future version, this will raise an exception. category=UnknownTimezoneWarning) /home/ubuntu/anaconda3/envs/bigbang/lib/python3.7/site-packages/dateutil/parser/_parser.py:1218: UnknownTimezoneWarning: tzname PST identified but not understood. Pass tzinfos argument in order to correctly return a timezone-aware datetime. In a future version, this will raise an exception. category=UnknownTimezoneWarning) /home/ubuntu/anaconda3/envs/bigbang/lib/python3.7/site-packages/dateutil/parser/_parser.py:1218: UnknownTimezoneWarning: tzname CDT identified but not understood. Pass tzinfos argument in order to correctly return a timezone-aware datetime. In a future version, this will raise an exception. category=UnknownTimezoneWarning) /home/ubuntu/anaconda3/envs/bigbang/lib/python3.7/site-packages/dateutil/parser/_parser.py:1218: UnknownTimezoneWarning: tzname EST identified but not understood. Pass tzinfos argument in order to correctly return a timezone-aware datetime. In a future version, this will raise an exception. category=UnknownTimezoneWarning) /home/ubuntu/anaconda3/envs/bigbang/lib/python3.7/site-packages/dateutil/parser/_parser.py:1218: UnknownTimezoneWarning: tzname JST identified but not understood. Pass tzinfos argument in order to correctly return a timezone-aware datetime. In a future version, this will raise an exception. category=UnknownTimezoneWarning) /home/ubuntu/anaconda3/envs/bigbang/lib/python3.7/site-packages/dateutil/parser/_parser.py:1218: UnknownTimezoneWarning: tzname MDT identified but not understood. Pass tzinfos argument in order to correctly return a timezone-aware datetime. In a future version, this will raise an exception. category=UnknownTimezoneWarning) /home/ubuntu/anaconda3/envs/bigbang/lib/python3.7/site-packages/dateutil/parser/_parser.py:1218: UnknownTimezoneWarning: tzname PDT identified but not understood. Pass tzinfos argument in order to correctly return a timezone-aware datetime. In a future version, this will raise an exception. category=UnknownTimezoneWarning) /home/ubuntu/anaconda3/envs/bigbang/lib/python3.7/site-packages/dateutil/parser/_parser.py:1218: UnknownTimezoneWarning: tzname IST identified but not understood. Pass tzinfos argument in order to correctly return a timezone-aware datetime. In a future version, this will raise an exception. category=UnknownTimezoneWarning) /home/ubuntu/anaconda3/envs/bigbang/lib/python3.7/site-packages/dateutil/parser/_parser.py:1218: UnknownTimezoneWarning: tzname UT identified but not understood. Pass tzinfos argument in order to correctly return a timezone-aware datetime. In a future version, this will raise an exception. category=UnknownTimezoneWarning) /home/ubuntu/anaconda3/envs/bigbang/lib/python3.7/site-packages/dateutil/parser/_parser.py:1218: UnknownTimezoneWarning: tzname CET identified but not understood. Pass tzinfos argument in order to correctly return a timezone-aware datetime. In a future version, this will raise an exception. category=UnknownTimezoneWarning) /home/ubuntu/anaconda3/envs/bigbang/lib/python3.7/site-packages/dateutil/parser/_parser.py:1218: UnknownTimezoneWarning: tzname MET identified but not understood. Pass tzinfos argument in order to correctly return a timezone-aware datetime. In a future version, this will raise an exception. category=UnknownTimezoneWarning) /home/ubuntu/anaconda3/envs/bigbang/lib/python3.7/site-packages/dateutil/parser/_parser.py:1218: UnknownTimezoneWarning: tzname MST identified but not understood. Pass tzinfos argument in order to correctly return a timezone-aware datetime. In a future version, this will raise an exception. category=UnknownTimezoneWarning) /home/ubuntu/anaconda3/envs/bigbang/lib/python3.7/site-packages/dateutil/parser/_parser.py:1218: UnknownTimezoneWarning: tzname CST identified but not understood. Pass tzinfos argument in order to correctly return a timezone-aware datetime. In a future version, this will raise an exception. category=UnknownTimezoneWarning) /home/ubuntu/anaconda3/envs/bigbang/lib/python3.7/site-packages/dateutil/parser/_parser.py:1218: UnknownTimezoneWarning: tzname DST identified but not understood. Pass tzinfos argument in order to correctly return a timezone-aware datetime. In a future version, this will raise an exception. category=UnknownTimezoneWarning) offset must be a timedelta strictly between -timedelta(hours=24) and timedelta(hours=24).

sbenthall commented 3 years ago

https://stackoverflow.com/questions/10406135/unicodedecodeerror-ascii-codec-cant-decode-byte-0xd1-in-position-2-ordinal?rq=1

sbenthall commented 3 years ago

https://github.com/jay0lee/got-your-back/issues/222

sbenthall commented 3 years ago

https://stackoverflow.com/questions/37890123/how-to-trap-an-exception-that-occurs-in-code-underlying-python-for-loop

nllz commented 3 years ago

https://stackoverflow.com/questions/37890123/how-to-trap-an-exception-that-occurs-in-code-underlying-python-for-loop Key seems to be:

text = data.decode(encoding="utf-8", errors="replace")

Isn't it?