Open XFastDataLab opened 1 year ago
As described in the research paper by Lukas Blecher et al., Nougat was trained mainly for English (which makes sense since most papers on Arxiv are in English -- English is still the world/common language), which means that other languages are unlikely to work exceptionally well. However, the paper mentions that Nougat works acceptably well for other Roman languages (Italian, German, French, etc.). Chinese is not a Roman language, and so as he mentions, this often results in repetitions (the missing page not found error).
If you want Nougat to recognize languages that are not Roman, such as Chinese, Japanese, or languages with the Cyrillic alphabet, the model would have to be fine-tuned. I'm working on a project to make preparing the training data for fine-tuning Nougat easier.
@marwinsteiner Hi, may I inquire where I can find the project related to preparing training data for fine-tuning Nougat? I'm highly interested. :)
@marwinsteiner Hi, may I inquire where I can find the project related to preparing training data for fine-tuning Nougat? I'm highly interested. :)
@xixuhu It is a currently private repo under my name. It is a WIP, does not currently support generation of finetuning datasets for nougat-ocr
. I'm still trying to figure out how I cand o that. However, this is the plan, with a light Streamlit frontend so you can choose some parameters like which language(s) you want, how many pages of training data, etc.
If you want to collaborate lmk \sorry for late reply
@marwinsteiner how do things work? im ready to help if you need :)
It works for english PDF file, but it seems not friendly for PDF that contains Chinese or Japanese characters. Can I train it myself? I think it is quite diffcult for me to prepare the trainning data.