'cp950' codec can't decode byte 0xb7 in position 0: incomplete multibyte sequence

MaskInLife commented 1 month ago

When I get htmlBody，I get this error in version 0.48.5.

msg = extract_msg.Message(file_path)`
html_body = msg.htmlBody

I saw this question, which mentioned that the cp950 problem will be solved in version 0.42.x, but I am currently using 0.48.5 and still have this problem.

TheElementalOfDestruction commented 1 month ago

If the traceback for the error includes RTFDE it's because the processing in RTFDE uses python's implementation of the codec rather than my own one. There is a relevant issue on their repository seamustuohy/RTFDE#19

If the traceback does not include RTFDE then I must have missed something

MaskInLife commented 1 month ago

If the traceback for the error includes RTFDE it's because the processing in RTFDE uses python's implementation of the codec rather than my own one. There is a relevant issue on their repository seamustuohy/RTFDE#19

If the traceback does not include RTFDE then I must have missed something

@TheElementalOfDestruction First of all, thank you for your reply. There is an RTFDE error message in my stack. In addition, I have seen this document when I was looking for the same problem as mine. However, I saw that you said that you would add some -msg support and use your own CP950 to fix the bug. I didn't see any other solutions. `(>﹏<)′

TheElementalOfDestruction commented 1 month ago

So the support that was added was for the internal stream system that MSG files used which can be in a number of encodings. The rtf body is just bytes so it doesn't have an inherent encoding, and as such sees no benefits from my fix. RTFDE then treats it as encoded data separately, and some sections may end up being CP950

Without my fix, msg files entirely encoded might not even open. I actually applied a patch to RTFDE that would cause it to autodetect the presence of extract-msg and use its version of cp950. Unfortunately it was reverted due to it being a bit of a shortcut and not a proper solution for RTFDE itself.

I'd recommend going and adding acknowledging the need for a fix to RTFDE on the RTFDE repository to try and get this fixed. Alternatively, you can use a similar principle to the fix I made and simply (or not so simply given how janky it is) use some code to import RTFDE and replace the internal function that determines what codepage to use with one that replaces returns of "cp950" with "windows950"

MaskInLife commented 1 month ago

So the support that was added was for the internal stream system that MSG files used which can be in a number of encodings. The rtf body is just bytes so it doesn't have an inherent encoding, and as such sees no benefits from my fix. RTFDE then treats it as encoded data separately, and some sections may end up being CP950

Without my fix, msg files entirely encoded might not even open. I actually applied a patch to RTFDE that would cause it to autodetect the presence of extract-msg and use its version of cp950. Unfortunately it was reverted due to it being a bit of a shortcut and not a proper solution for RTFDE itself.

I'd recommend going and adding acknowledging the need for a fix to RTFDE on the RTFDE repository to try and get this fixed. Alternatively, you can use a similar principle to the fix I made and simply (or not so simply given how janky it is) use some code to import RTFDE and replace the internal function that determines what codepage to use with one that replaces returns of "cp950" with "windows950"

The encoding problem is really a headache, but thank you for your answer. I will try again. Thank you! 🧡

TheElementalOfDestruction commented 1 month ago

If you want to try my really Hank solution, this should basically do the trick:

import RTFDE

_internal = RTFDE.text_extraction.get_python_codec

def get_python_codec(codepage: int):
    temp = _internal(codepage)
    return 'windows-950' if temp.lower() == 'cp950' else temp

RTDFE.text_extraction.get_python_codec = get_python_codec

What it does is create a function to replace the one RTFDE uses that tells the code what codepage to use for decoding, and just specifically looks for CP950 to replace it. It's very dirty but it should work just fine

MaskInLife commented 1 month ago

If you want to try my really Hank solution, this should basically do the trick:
import RTFDE

_internal = RTFDE.text_extraction.get_python_codec

def get_python_codec(codepage: int):
    temp = _internal(codepage)
    return 'windows-950' if temp.lower() == 'cp950' else temp

RTDFE.text_extraction.get_python_codec = get_python_codec
What it does is create a function to replace the one RTFDE uses that tells the code what codepage to use for decoding, and just specifically looks for CP950 to replace it. It's very dirty but it should work just fine

Oh! Thank you for sharing, I will try it, Thanks♪(･ω･)ﾉ ψ(｀∇´)ψ

TheElementalOfDestruction commented 1 month ago

If the code has issues it's probably because I tried to write it super quickly in like 5 minutes. I did something at least extremely similar before that did work

MaskInLife commented 1 month ago

If the code has issues it's probably because I tried to write it super quickly in like 5 minutes. I did something at least extremely similar before that did work

Hello, I'm back. I've tried this method, and it works, but it seems that the windows-950 encoding can't be fully parsed either. The current error message is: UnicodeDecodeError: 'windows-950' codec can't decode byte 0xb7 in position 0: unexpected end of data. I've tried Big5, utf-8 and other encodings, but none of them can parse it correctly. Maybe this email contains multiple characters in an encoding format?This is really a headache. I want to ask if there is a way to skip when encountering an unrecognizable encoding. Although this is a bit silly, it feels like it can solve the urgent problem. (～￣▽￣)～

TheElementalOfDestruction commented 1 month ago

Oh I didn't actually read the error message completely. The message is complaining that the second byte of a multibyte sequence is just not present for some reason. It's possible the bytes were split up before being parsed, possibly because they were in a format RTFDE wasn't expecting like having the bytes just chilling unescaped. To know if that's what has happened I'd have to analyze the actual RTF body of the email, which unfortunately may contain sensitive data. If you can share just the RTF body and not the rest of the email then that would be awesome as I could see the full situation of what is going on with much of the sensitive data gone. Alternatively you can just take the RTF body and redact information from it that you can read that is sensitive and send that and it should be good enough

MaskInLife commented 1 month ago

Oh I didn't actually read the error message completely. The message is complaining that the second byte of a multibyte sequence is just not present for some reason. It's possible the bytes were split up before being parsed, possibly because they were in a format RTFDE wasn't expecting like having the bytes just chilling unescaped. To know if that's what has happened I'd have to analyze the actual RTF body of the email, which unfortunately may contain sensitive data. If you can share just the RTF body and not the rest of the email then that would be awesome as I could see the full situation of what is going on with much of the sensitive data gone. Alternatively you can just take the RTF body and redact information from it that you can read that is sensitive and send that and it should be good enough

Hihi, I'm sorry, this email information is a bit sensitive, it may not be convenient to share it, or I will see if I can remove some sensitive information from RtfBody and share it later. At present, I use ISO-8859-1 encoding to ignore this encoding error, which seems to be feasible at present. (ToT)/~~~

TeamMsgExtractor / msg-extractor

'cp950' codec can't decode byte 0xb7 in position 0: incomplete multibyte sequence #430