X-PLUG / mPLUG-DocOwl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Apache License 2.0
1.12k stars 68 forks source link

Which data make the model learned table to markdown? #47

Open lucasjinreal opened 2 months ago

lucasjinreal commented 2 months ago

These opened dataset can not really find which dataset can hav img -> markdown text information.

And where does the Chinese OCR ability comes from? The whole dataset has no Chinese,

HAWLYQ commented 2 months ago

Hi, @lucasjinreal , you can filter table-to-markdown samples from DocStruct4M by 'task_name'=='mp_sft' or 'dataset_name' in ['TURL', 'PubTabNet']. Besides, our model doesn't support Chinese OCR yet~

lucasjinreal commented 2 months ago

Hi, still wanna ask 2 question.

  1. the dataset shows all ocnvert tables to markdown, how about formula, normal markdown articels?
  2. the conversion used span to wrap markdown, will it effect model learning if some of the data ocnsistant markdown with plain text?
HAWLYQ commented 2 months ago

Hi, still wanna ask 2 question.

  1. the dataset shows all ocnvert tables to markdown, how about formula, normal markdown articels?
  2. the conversion used span to wrap markdown, will it effect model learning if some of the data ocnsistant markdown with plain text?

Hi, @lucasjinreal :

  1. we haven't considered the formula, which doesn't exist in our datasets. Besides, we haven't tried converting articles into markdown style. This may be a good idea to represent the structure of normal articles.
  2. Special span tokens can better distinguish structure information and plain text content in the tables. Therefore, this may be easier for a model to learn both table structure and text reading.