DS4SD / docling

Get your documents ready for gen AI
https://ds4sd.github.io/docling
MIT License
10.48k stars 507 forks source link

Is it possible to fine tune with our own datasets? #411

Closed ninedesu closed 9 hours ago

ninedesu commented 12 hours ago

I want to know if we can use our own dataset to finetune the OCR

PeterStaar-IBM commented 9 hours ago

@ninedesu This is an excellent question, and yes, we plan to build a community where people can contribute data for fine-tuning. At the moment, we are gathering all our internal and external datasets (eg https://huggingface.co/datasets/ds4sd/DocLayNet) and preparing them so we can share them all on the huggingface website!

With regard to OCR, we have a bit of work to do and are right now relying on 3rd party OCR.