Open lucasjinreal opened 2 months ago
Hi, @lucasjinreal , you can filter table-to-markdown samples from DocStruct4M by 'task_name'=='mp_sft'
or 'dataset_name' in ['TURL', 'PubTabNet']
.
Besides, our model doesn't support Chinese OCR yet~
Hi, still wanna ask 2 question.
Hi, still wanna ask 2 question.
- the dataset shows all ocnvert tables to markdown, how about formula, normal markdown articels?
- the conversion used span to wrap markdown, will it effect model learning if some of the data ocnsistant markdown with plain text?
Hi, @lucasjinreal :
These opened dataset can not really find which dataset can hav img -> markdown text information.
And where does the Chinese OCR ability comes from? The whole dataset has no Chinese,