Train nougat with mbart scratch init

strand2013 commented 10 months ago

Great work! I want to train the nougat-base at chinese ocr task, so I need to change the tokenizer.json I got a question? If train the model from mbart scratch, can it work?

Whalesong-zrs commented 9 months ago

I add bert-base-chinese tokenizer.json into this tokenizer.json, and I overfit the model, it can output the chinese character.

strand2013 commented 9 months ago

Thank you for your reply, How to merge the two tokenizer.json

Whalesong-zrs commented 9 months ago

我直接用中文回复吧，你可以参考他的tokenizer.json里的格式，他是一个字典，字符-id pair，所以你把常用的字符接在英文的后面就行了，我是这样做的，但是他过拟合还是有点小问题，我输入20张pdf做训练，然后用这20张验证，往往只能输出第一个batch或者第二个batch的信息，所以我现在正在做大批量的中文数据集，希望能有效果

strand2013 commented 9 months ago

好的，你的模型是从随机初始化开始训练的吗

Whalesong-zrs commented 9 months ago

现在数据还没做好，现在想的是用他的finetune，之前过拟合是finetune的

Whalesong-zrs commented 9 months ago

想问一下你最近还在做这个工作吗，想跟你交流一下

limaopeng1 commented 8 months ago

请问有开源的中文数据集吗，我想训练试下效果

Whalesong-zrs commented 8 months ago

没有，是自己在做

limaopeng1 commented 8 months ago

可以分享下你是怎么制做数据集的吗，另外你现在训练的模型效果怎么样呢

Whalesong-zrs commented 8 months ago

找中文的文本渲染pdf，现在还在做基础的验证，训了四个epoch，在自己的小规模验证集达到了60%

limaopeng1 commented 8 months ago

请问中文渲染pdf后如何进行分页呢，我试了作者的split_htmls_to_pages.py，代码中使用了unidecode对字符进行编码，也就是中文会被转换成如下的编码（看着是拼音）：

Tai Wan Zong He Yan Jiu Yuan Yuan Chang 1994Nian 2Yue - ?Nian Dong Nan Ya Tou Zi Gong Si Dong Shi Chang 1998Nian 9Yue - ?Nian Zong Tong Fu Guo Ce Gu Wen 2001Nian 5Yue 20Ri

Whalesong-zrs commented 8 months ago

我是每次只生成一张pdf，这样能保证字符和图片是对应的，不存在分页的问题

Whalesong-zrs commented 8 months ago

有好友验证，我通不过

openforward commented 5 months ago

请问下，你们微调中文版nougat ，train_yaml如何设置的？为啥我这边损失返回都是nan，已经换了中文tokenizer

Whalesong-zrs commented 5 months ago

基本没怎么改，我们是finetune的，所以model_path设置了作者本来的模型路径，还有就是我们的tokenizer是中文和英文拼起来，然后训练也是中文和英文一起训，防止遗忘

openforward commented 5 months ago

QQ460689290 方便加您,简单请教一下吗？

Whalesong-zrs commented 5 months ago

已加

SidneyRey commented 5 months ago

你好，请问你是否在英文上finetune过，有没现成的finetune的数据集。我想先参考英文的数据格式，来做中文数据。方便加下好友，想向你学习学习。QQ：510341751

facebookresearch / nougat

Train nougat with mbart scratch init #181