MartinPacker / md2pptx

Markdown To PowerPoint converter
MIT License
206 stars 31 forks source link

There's been an error:UnicodeEncodeError #143

Closed zhuwentao2150 closed 1 year ago

zhuwentao2150 commented 1 year ago

Hello, this is a great library! during the process of using it I found out that it doesn't support markdown files with Chinese characters, how can I modify it so that it can support converting md files containing Chinese characters to pptx? Thanks a lot!

图片

MartinPacker commented 1 year ago

That's a good question - to which I don't know the answer. It looks to me like it's actually a problem in lxml - which python-pptx relies on (and hence md2pptx relies on).

I'd love to fix this - if it's possible (even if I would rely on you to test it).

Does anyone know how to get python-pptx or lxml to accept Chinese characters?

One possibility worth testing is whether entity references can be used to insert Chinese characters. That might provide a workaround - if it worked.

zhuwentao2150 commented 1 year ago

I'd be more than happy to help test it, I've tried generating slides with Chinese using python-pptx with good results and now wonder if the problem is with lxml.

MartinPacker commented 1 year ago

Thanks for the offer to help - by testing. If python-pptx works it's either not a lxml problem or else it's the way md2pptx is directly using lxml.

What happens if you try encoding a string (a character or two) using entity references?

I'll note that md2pptx only uses lxml directly for fancy things - manipulating XML beyond what python-pptx does.

MartinPacker commented 1 year ago

Looks like it's the string addFormattedText passes to python-pptx that's the problem.

MartinPacker commented 1 year ago

image

The above - from the python-pptx docs for adding text to a run - might have some bearing on the problem. Particularly the "assumed to be UTF-8" bit.

MartinPacker commented 1 year ago

I wonder if making md2pptx UTF-16, instead of UTF-8, would help.

MartinPacker commented 1 year ago

Now I'm confused. I have a UTF-8 file with the following in and md2pptx has no trouble with it:

### String Test

* ディアボリックラヴァーズ or バッテリ
* 已下架
* لحضور المؤتمر الدولي العاشر
* 类, 有 优 先 选 择 的 权 利

(Note the right to left arabic texton the third line.)

zhuwentao2150 commented 1 year ago

It's very strange indeed. I can't use the UTF-8 format, but try to use the GB2312 format and it generates text with Chinese characters.

MartinPacker commented 1 year ago

Thanks for the feedback. I don't know where we go from here. BTW I've not heard of GB2312 before.

MartinPacker commented 1 year ago

I'm inclined to close this Issue. I don't normally do this without a fix (or similarly valid resolution).

The reason is I don't see a "fix" short of copying the input data to a temporary file with a different encoding. That seems plain wrong to me.

Anyhow I won't close this for a few days. And in any case it can always be reopened.

MartinPacker commented 1 year ago

I'm closing now as I don't see a meaningful way to progress it. It can always be re-opened.