Byaidu / PDFMathTranslate

PDF scientific paper translation and bilingual comparison - 完整保留排版的 PDF 文档全文双语翻译,支持 Google/DeepL/Ollama/OpenAI 翻译
GNU Affero General Public License v3.0
648 stars 71 forks source link

KeyError:‘china-ss’ #11

Closed xxsunyxx closed 1 week ago

xxsunyxx commented 1 week ago

Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "C:\anaconda\envs\ppp\Scripts\pdf2zh.exe__main__.py", line 7, in File "C:\anaconda\envs\ppp\Lib\site-packages\pdf2zh\pdf2zh.py", line 214, in main extract_text(vars(parsed_args)) File "C:\anaconda\envs\ppp\Lib\site-packages\pdf2zh\pdf2zh.py", line 98, in extract_text obj_patch:dict=pdf2zh.high_level.extract_text_to_fp(fp, locals()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\anaconda\envs\ppp\Lib\site-packages\pdf2zh\high_level.py", line 168, in extract_text_to_fp interpreter.process_page(page) File "C:\anaconda\envs\ppp\Lib\site-packages\pdf2zh\pdfinterp.py", line 1005, in process_page ops_new=self.device.end_page(page) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\anaconda\envs\ppp\Lib\site-packages\pdf2zh\converter.py", line 118, in end_page return self.receive_layout(self.cur_item) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\anaconda\envs\ppp\Lib\site-packages\pdf2zh\converter.py", line 664, in receivelayout ops=render(ltpage) ^^^^^^^^^^^^^^ File "C:\anaconda\envs\ppp\Lib\site-packages\pdf2zh\converter.py", line 620, in render adv=self.fontmap[fcur].char_width(ord(ch))*size


KeyError: 'china-ss'

python 3.11.1 windows server system
Byaidu commented 1 week ago

可以上传一下测试用的文件吗

xxsunyxx commented 1 week ago

https://wwqp.lanzouw.com/ilUsE2duahpi

xxsunyxx commented 1 week ago

如果翻译的过程能由ollama完成就好了

Byaidu commented 1 week ago

之前试过用 ollama,但是翻译结果很怪,你有解决思路吗

import ollama
def translate(text,lang_to):
    return ollama.chat(model='llama3.2', messages=[
        {
            'role': 'system',
            'content': 'You are a professional, authentic machine translation engine.',
        },
        { 'role': 'user', 'content': f'Translate the following source text to {lang_to}.Output translation directly without any additional text.\nSource Text: {text}\n\nTranslated Text:' },
    ])
translate('If we wish to include a SAT-solver in this kind of system, there are three inherent challenges. First, while a SAT-solver clearly has an input and an output, the intervening process is NP-complete, and is thus likely to present an unacceptable bottleneck. This suggests a need to design approximate, rather than exact solvers. Second, a SATsolver is not differentiable with respect to its parameters, so computing gradients of the overall network’s loss function with respect to them is not possible. Third—and this is perhaps the most interesting source of new possibilities—it is not immediately clear what the input to a solver should be.','chinese')['message']['content']

如果我们希望在系统中包含一个SAT求解器,那么就存在三个内在的挑战。首先,一个SAT求解器明显有输入和输出,但间接过程是NP完成的,很可能导致不可接受的瓶颈。这个提示意味着需要设计出近似,而不是精确的求解器。第二,一个SAT求解器不与其参数相连 differentiated,因此计算整个网络损失函数的总体损失函数的导数并不能实现。第三—and这一点是可能成为新可能性源头—it 并不是立即清楚一个求解器的输入是什么。

Byaidu commented 1 week ago

KeyError: 'china-ss' 的问题修复了

xxsunyxx commented 1 week ago

简单可用的llm调用,我自己测试效率很低 https://github.com/xxsunyxx/LLM-TranslateScript,但可用,代码简单易懂 用的llm studio做后台

Byaidu commented 1 week ago

摆烂了,我感觉折腾这个 llm 效果应该不会比 google 好

xxsunyxx commented 1 week ago

也是新作者,跑不过去我submit a commit,不过确实能用。

Byaidu commented 1 week ago

import ollama
import mtranslate
def translate(text,model,lang_to='chinese'):
    return ollama.chat(model=model, messages=[
        {
            'role': 'system',
            'content': 'You are a professional, authentic machine translation engine.',
        },
        { 'role': 'user', 'content': f'Translate the following source text to {lang_to}. Output translation directly without any additional text.\nSource Text: {text}\nTranslated Text:' },
    ])['message']['content'].strip()
raw='In Equation (7.1), $v7$ is the cost of taking action $v8$ in state $v9$, and $v10$ corresponds to selection of a heuristic $v11$. If $v12$ selects a literal $v13$, then $v14$ is the state reached from $v15$ using $v16$, and $v17$ is the state reached from $v18$ using $v19$. In order to learn $v20$ it is represented as'
print('llama3.2:',translate(raw,'llama3.2'))
print('gemma2:',translate(raw,'gemma2'))
print('qwen2.5:',translate(raw,'qwen2.5'))
print('google:',mtranslate.translate(raw,'zh-CN'))
llama3.2: In Equation (7.1),V7是行为V8在状态V9中所付出的代价,表示选择一个预测函数V11的机制。 如果V12选取一个字面V13,则V14是使用V16从V15开始得出状态,V17是使用V19从V18开始得出状态,从而以V20来学习。
gemma2: 在式 (7.1) 中,$v7$ 是在状态 $v9$ 下执行动作 $v8$ 的成本,而 $v10$ 对应选择启发式 $v11$。如果 $v12$ 选择一个文字 $v13$,那么 $v14$ 是使用 $v16$ 从 $v15$ 抵达的状态,而 $v17$ 是使用 $v19$ 从 $v18$ 抵达的状态。为了学习 $v20$,它被表示为
qwen2.5: 在方程(7.1)中,$v7$是采取动作$v8$在状态$v9$的成本,并且$v10$对应于选择启发式$v11$。如果$v12$选择了一个文字$v13$,那么$v14$是从使用$v16$的状态$v15$到达的状态,而$v17$是从使用$v19$的状态$v18$到达的状态。为了学习$v20$,它被表示为
google: 在公式 (7.1) 中,$v7$ 是在状态 $v9$ 下采取行动 $v8$ 的成本,而 $v10$ 对应于启发式 $v11$ 的选择。如果 $v12$ 选择文字 $v13$,则 $v14$ 是从 $v15$ 使用 $v16$ 到达的状态,而 $v17$ 是从 $v18$ 使用 $v19$ 到达的状态。为了学习 $v20$,它表示为

测了一下感觉 gemma 的效果最好,我试试下个版本加进来

xxsunyxx commented 1 week ago

后续仔细想以下,你这个估计可能是有问题的,ollama的调用和llm studio略有区别,调用了gemma2相当于给了一个确定的llm,然后是段落的问题,考虑到段落长度的问题,可能限定到(更长的)1k以内也许llm会有更好的翻译结果.

xxsunyxx commented 1 week ago

而且不同量化的模型给的结果也都会有区别

Byaidu commented 1 week ago

现在可以 -s gemma2 调用 ollama 翻译了

xxsunyxx commented 1 week ago

是localhost的ollama么?

Byaidu commented 1 week ago

对的

xxsunyxx commented 1 week ago

之前试过用 ollama,但是翻译结果很怪,你有解决思路吗

import ollama
def translate(text,lang_to):
    return ollama.chat(model='llama3.2', messages=[
        {
            'role': 'system',
            'content': 'You are a professional, authentic machine translation engine.',
        },
        { 'role': 'user', 'content': f'Translate the following source text to {lang_to}.Output translation directly without any additional text.\nSource Text: {text}\n\nTranslated Text:' },
    ])
translate('If we wish to include a SAT-solver in this kind of system, there are three inherent challenges. First, while a SAT-solver clearly has an input and an output, the intervening process is NP-complete, and is thus likely to present an unacceptable bottleneck. This suggests a need to design approximate, rather than exact solvers. Second, a SATsolver is not differentiable with respect to its parameters, so computing gradients of the overall network’s loss function with respect to them is not possible. Third—and this is perhaps the most interesting source of new possibilities—it is not immediately clear what the input to a solver should be.','chinese')['message']['content']

如果我们希望在系统中包含一个SAT求解器,那么就存在三个内在的挑战。首先,一个SAT求解器明显有输入和输出,但间接过程是NP完成的,很可能导致不可接受的瓶颈。这个提示意味着需要设计出近似,而不是精确的求解器。第二,一个SAT求解器不与其参数相连 differentiated,因此计算整个网络损失函数的总体损失函数的导数并不能实现。第三—and这一点是可能成为新可能性源头—it 并不是立即清楚一个求解器的输入是什么。

我回头再看,你这段大概率是用llama3 model完成的,llama3对中文不友好,稍微回复长一些,就会产生大量的英文夹杂,不能用,目前大模型里中文比较友好的有Qwen,deepseek,gemma2。 很实际来说,翻译7b可以完成简单的,13-16b的推理就会很完善,但对显卡要求稍高,你们做学术我个人觉得要采用较高的模型,我测试的话,32b可能就没太大意义。仅供参考。

xxsunyxx commented 1 week ago

我已经跑完了昨天提供的文件

(pdf) D:>pdf2zh a1.pdf -s gemma2 Downloading... D:\github\DocLayout-YOLO\doclayout_yolo\nn\tasks.py:733: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. ckpt = torch.load(file, map_location="cpu") 91%|███████████████████████████████████████████████████████████████████████▎ | 868/949 [1:32:40<14:24, 10.67s/it]ERROR:pdf2zh.converter: 100%|██████████████████████████████████████████████████████████████████████████████| 949/949 [2:07:46<00:00, 8.08s/it]

(pdf) D:>

其中有一个报错,ERROR:pdf2zh.converter 是否哪里有log我可以提供给你。 结果并不完美,我随后上传。

xxsunyxx commented 1 week ago

https://wwqp.lanzouw.com/izgv52dxm3mb

也许只是yolo的分块仍然存在问题,但我估计arvix类的简单page应该已经可以相对完美的解决了

xxsunyxx commented 1 week ago

pdf2zh a1.pdf -s qwen2.5:14b 也跑通了,你的ollama调用已经没问题。跑一次需要3小时左右,3小时候反馈。

Byaidu commented 1 week ago

那个 error 应该不重要,一般就是网络波动导致的

另外就是你这个测试的排版太复杂了,我猜测 yolo 训练集里可能没有相似的样本

xxsunyxx commented 1 week ago

我准备测试下一个方案,将文档转换为doc,回避yolo排版的问题

xxsunyxx commented 1 week ago

转换后的文件,用原生的acrobat或者比较合规的产品abbyy finereader都无法打开。 我猜你用的google chrome浏览器进行的预览。这个应该算是bug。

Byaidu commented 1 week ago

摆烂了,能用文献管理器打开就是胜利,反正也不是商业项目对吧

xxsunyxx commented 1 week ago

哈哈,你这个已经很接近商业项目了,我如果正常去网站调用api要50块左右。 虽然我只是拿来给小孩做课本用的。已经十分感谢你了。

也许很快star就不会允许你摆烂了,就好比我玩了whisper,用whisper完成了所有ted的英语字幕 你无偿的发布了出去,随后就会有同学联系你有这样那样的小需求

Byaidu commented 1 week ago

啊希望如此吧,虽然说发布了两个月基本没什么热度……

xxsunyxx commented 1 week ago

快了,我跑了最近更新过的所有项目,只有你这个可用。 你的requirements里缺了huggingface,doclayout-yolo还需要手动造一下轮子 如果说障碍,把下载迁到hf-mirror之类的回避一下强强还是可以的。

Byaidu commented 1 week ago

转换后的文件,用原生的acrobat或者比较合规的产品abbyy finereader都无法打开。 我猜你用的google chrome浏览器进行的预览。这个应该算是bug。

新版本应该解决了大部分兼容问题

xxsunyxx commented 1 week ago

今天自己搞了个docx翻,意外发现版面格式分析后可以用nltk进行段落整理,随后去翻译,效果有提升

Byaidu commented 1 week ago

看看效果

xxsunyxx commented 1 week ago

你发个paper给我吧,我那个因为复杂排版确实不太适合再跑一次

Byaidu commented 1 week ago

SUSY.pdf

xxsunyxx commented 1 week ago

跑完两次一塌糊涂,见笑了,没能力排除latex的部分。

xxsunyxx commented 1 week ago

SUSY-zh.pdf 发一只ollama qwen2.5 14b的翻译结果吧.