LAKostis / mailman2sympa

mailman2sympa migration scripts
3 stars 2 forks source link

UnicodeDecodeError on lists with international users #8

Open jpl166 opened 3 months ago

jpl166 commented 3 months ago

When trying to migrate lists with utf-8 characters anywhere in their configs, I get the a decode error from mm2s_unpickle:

Traceback (most recent call last): File "./mm2s_unpickle.py", line 29, in print(json.dumps(config_dict)) File "/usr/lib64/python2.7/json/init.py", line 244, in dumps return _default_encoder.encode(obj) File "/usr/lib64/python2.7/json/encoder.py", line 207, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib64/python2.7/json/encoder.py", line 270, in iterencode return _iterencode(o, 0) UnicodeDecodeError: 'utf8' codec can't decode byte 0xdc in position 0: invalid continuation byte

I modified mm2s_unpickle to print the raw unpickled config before the json.dumps() call, and digging around in that output I found a user as follows:

'REDACTED@gmail.com': '\xdcz\xfcc\xfchemzem'

Digging around further I found hundreds of examples in this one list's config and user community of non-ascii characters that are turning up in that output. In their passwords and in some users' names. I have other lists where we have such characters in the descriptions and info. All of these fail to migrate in explosive ways (tens of thousands of lines of console output VERY VERY QUICKLY). While some of these lists we could work around the problem by changing one subscriber's name to ascii, migrating, and then changing back, that one list has 14,000 subscribers and literally hundreds of examples of this breaking the migration.

I tried adding a ensure_ascii=False to the json.dumps() call and it made no difference in mm2s_unpickle.py. It appears that json.dumps() in Python3 would just do the right thing, but that won't load the mailman.bouncer module.

And of course, mailman2 itself has no problems with these characters, it's just the migration tool.