getomni-ai / zerox

PDF to Markdown with vision models
https://getomni.ai/ocr-demo
MIT License
6.68k stars 363 forks source link

Is it possible to opensource a markdown converting dataset? #3

Open MonolithFoundation opened 4 months ago

MonolithFoundation commented 4 months ago

Is it possible to opensource a markdown converting dataset?

tylermaran commented 4 months ago

Probably! Are you thinking a training set of pdf => markdown?

It's probably not something I'll be working on right away, but I will be putting together some benchmarks for testing different models. But that will probably be in the 50-100 document range. And probably not meaningful as as training set.

MonolithFoundation commented 4 months ago

if it is can be trainig that will helpful in training MLLM model for OCR and Markdown converting like gpt4o

---- Replied Message ---- | From | Tyler @.> | | Date | 07/28/2024 02:59 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [getomni-ai/zerox] Is it possible to opensource a markdown converting dataset? (Issue #3) |

Probably! Are you thinking a training set of pdf => markdown?

It's probably not something I'll be working on right away, but I will be putting together some benchmarks for testing different models. But that will probably be in the 50-100 document range. And probably not meaningful as as training set.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

wizenheimer commented 2 months ago

Hey @MonolithFoundation, You could give this a shot. Pretty early but would help you bootstrap a dataset. Have published a few sample datasets as well. Cheers!

MonolithFoundation commented 2 months ago

@wizenheimer hello, looks nice a very nice tool! thanks for opensourcing, however, didn't found an open link to PDF (image) -> markdown (text) dataset out of box.

Will you consider open a such gened dataset?

wizenheimer commented 2 months ago

Thanks :D You could take a look here and here. It's a sampled set from datasets like DocLayNet, LayoutLM among others. The size is far from being usable for training/evaluation but could be a good start.