Closed ninedesu closed 9 hours ago
@ninedesu This is an excellent question, and yes, we plan to build a community where people can contribute data for fine-tuning. At the moment, we are gathering all our internal and external datasets (eg https://huggingface.co/datasets/ds4sd/DocLayNet) and preparing them so we can share them all on the huggingface website!
With regard to OCR, we have a bit of work to do and are right now relying on 3rd party OCR.
I want to know if we can use our own dataset to finetune the OCR