ArtifexSoftware / pdf2docx

Open source Python library for converting PDF to DOCX.
https://pdf2docx.readthedocs.io
GNU Affero General Public License v3.0
2.46k stars 356 forks source link

转换时遇到字体名为中文(比如“宋体”)时,发生错误 #286

Closed hlhtddx closed 4 days ago

hlhtddx commented 4 months ago

如题,转换时遇到字体名为中文(比如“宋体”)时,发生错误 bytes must be in range[0 to 255] 错误点在 https://github.com/ArtifexSoftware/pdf2docx/blame/master/pdf2docx/common/share.py#L128 当字体名称为中文时,ord(c)大于255,转换成bytes时会报错

def decode(s:str):
    '''Try to decode a unicode string.'''
    b = bytes(ord(c) for c in s) ### 这里出错
    for encoding in ['utf-8', 'gbk', 'gb2312', 'iso-8859-1']:
        try:
            res = b.decode(encoding)
            break
        except:
            continue
    return res
hlhtddx commented 4 months ago

缺了一遍,只有在选择multiprocessing=True才会出现问题,单进程模式不会出问题

greendreamer commented 1 week ago

Hi @hlhtddx , please provide songti font file for reproducing.

greendreamer commented 4 days ago

Closing this for lack of reaction for an extended amount of time. Feel free to open a new issue.