UnicodeEncodeError The input .md file in simplified Chinese

Lydiagugugaga commented 3 weeks ago

UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 23: surrogates not allowed

I'm trying to convert a md file whose content is in Simplified Chinese, but I'm encountering encoding problems. I've read that the latest version mentions fixing #161, but I still can't get it to work on my end, so I'd like to ask what's the best way to fix it.

MartinPacker commented 3 weeks ago

Thank you for reporting this.

I'm away from my computer for the next couple of days- so can't look at this right away.

Can you somehow get me a minimal reducible example? I will also say that a workaround might be using a hexadecimal entity reference. But that's probably not a scalable behaviour.

Lydiagugugaga commented 3 weeks ago

Thank you for reporting this.

I'm away from my computer for the next couple of days- so can't look at this right away.

Can you somehow get me a minimal reducible example? I will also say that a workaround might be using a hexadecimal entity reference. But that's probably not a scalable behaviour.

I just try to input python md2pptx output.pptx < 222.md

MartinPacker commented 3 weeks ago

Thanks. I need the 222.md file - or a minimal version of it. (No confidential etc data.)

Lydiagugugaga commented 3 weeks ago

Thanks for your reply. Here is just an example markdown file:

222.md

MartinPacker commented 3 weeks ago

Thanks for this. I note your attempt to use <p> paragraph tags. Those aren't supported by md2pptx - if I remember correctly. I would use asterisks * instead.

If you think paragraph tags should be supported - and have a clear idea as to how they should be rendered - please open another issue.

Lydiagugugaga commented 3 weeks ago

Thanks for this. I note your attempt to use <p> paragraph tags. Those aren't supported by md2pptx - if I remember correctly. I would use asterisks * instead.

If you think paragraph tags should be supported - and have a clear idea as to how they should be rendered - please open another issue.

Thanks for your reply. About <p> paragraph tags, I thought it was the problem before, but I actually tried removing it and using the generic .md form and it doesn't work either.

MartinPacker commented 3 weeks ago

Right. BBEdit (one of my editors of choice) thinks the file is UTF-8 but I suspect it isn't. Sniffing what it is is an approach I might take.

MartinPacker commented 3 weeks ago

This is strange: My run with your file yields this:

md2pptx Markdown To Powerpoint Converter 5.0.2+ 15 August, 2024
===============================================================

Open source project: https://github.com/MartinPacker/md2pptx

External Dependencies:

  Python: 3.9.6
  python-pptx: 0.6.23
  Pillow: 10.3.0
  CairoSVG: Not Installed
  graphviz: Not Installed

Internal Dependencies:

  funnel: 0.1
  runPython: 0.4

No slide to document metadata on. Continuing without it.

Slides:
=======

   1   初学者骑车之路：掌握自行车技巧的必备指南
   2   自行车基础知识
   3       自行车的组成部分
   4       自行车的类型和用途
   5   准备骑行前的注意事项
   6       自行车装备和保养
   7       骑行安全知识和规则
   8   学习骑行技巧
   9       自行车平衡和姿势
  10       踩踏和换挡技巧
  11       转弯和刹车技巧

MartinPacker commented 3 weeks ago

I'm suspecting your problem is with python-pptx or lxml, rather than md2pptx. But I keep an open mind about this.

Lydiagugugaga commented 3 weeks ago

I'm suspecting your problem is with python-pptx or lxml, rather than md2pptx. But I keep an open mind about this.

Thank you so much for helping me with this question. I've referenced some of the previously mentioned issues and also tried the python-pptx version change which is currently v0.6.23. But is didn't work.

If it's a problem with python-pptx or lxml, what do you suggest to fix it?

MartinPacker commented 3 weeks ago

I've just fixed a problem with numeric character references. So with the very latest push a workaround for you might well be to use character references such as &#dc80;. Fiddly, I know.

Lydiagugugaga commented 3 weeks ago

So with the very latest push a workaround for you might well be to use character references such as &#dc80;. Fiddly, I know.

Thank you very much. I'll try it.

MartinPacker commented 3 weeks ago

Please let me know how you get on. And do you think the text is really UTF-16 rather than UTF-8? The U+DC80 character isn't valid in UTF-8, apparently.

(And I just pushed some doc changes after the one that fixes numeric character references - so don't get confused by what the latest commit says.)

MartinPacker / md2pptx

UnicodeEncodeError The input .md file in simplified Chinese #163