anirvan / yahoo-group-archive-tools

Converts a Yahoo group archive created by yahoo-group-archiver into standalone email, mbox folders, and PDF files
MIT License
22 stars 2 forks source link

mbox opened with UTF-8 encoding but contains invalid characters #6

Open anirvan opened 4 years ago

anirvan commented 4 years ago

I can't make a final judgement because the import of the .mbox file into kmail (v5.10.3) choked on the 4793rd email (out of 7615). When I try to open the .mbox file in a text editor (kate), I get the error "The file .mbox was opened with UTF-8 encoding but contained invalid characters." Examining the .mbox file, it seems as if all the messages are contained in it; but I'm guessing there are some invalid characters preventing the import from completing?

Originally posted by @jnew-gh in https://github.com/anirvan/yahoo-group-archive-tools/issues/2#issuecomment-566259511

anirvan commented 4 years ago

I don't fully understand this, given that the code's already deleting every single non-7-bit character from the raw email message. Maybe this is an issue with an invalid Unicode character in the original email message as sent by the list participant? If that were the case, I don't think it's appropriate for this script to try to correct it.

But we should find this out. @jnew-gh, can you take a look at the 4793rd email in your list, and see what's going on with that?

If you run yahoo-groups-archive-tools with the --noisy option, it'll say something like

message 5000: wrote email at /somewhere/email/5000.eml (4793 of 7615)

Once you know the message ID (in this example, 5000), you can look at the associated .eml file. Is there something funky going on there?

Feel free to paste the headers as a comment, redacting any private bits. Thank you!

P.S. In addition to kmail, could you try loading the mailbox using another mbox-friendly mail client, e.g. mutt, Thunderbird, Apple Mail, etc.? I'm curious if this is a kmail specific issue.