About dataset when train model

X-PLUG / mPLUG-DocOwl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Apache License 2.0

1.12k stars 68 forks source link

About dataset when train model #77

Closed SWHL closed 2 weeks ago

SWHL commented 1 month ago

Hello, thank you for your excellent work.

I am curious about the identifiers <doc></doc>, <ocr></ocr>, <md></md> etc. introduced in the DocStruct4M dataset to unify various tasks. I did not see any special treatment for them in the training code, and they did not appear as special tokens.

I would like to ask, during training, are these introduced identifiers treated as normal text? Or are there other processing techniques?

Thank you.

HAWLYQ commented 1 month ago

Hi @SWHL, there is no special treatment for such identifiers. They are treated as normal text and just used to distinguish texts parsed from images with normal answers given by the LLM.

SWHL commented 1 month ago

Thanks for your reponse. It's very clear.

I have another question about DocDownStream1.0, which does not appear in the paper. Where is the dataset used?

HAWLYQ commented 1 month ago

Hi @SWHL, DocDownStream1.0 is used in 2nd-stage training~

SWHL commented 1 month ago

Thanks for your reponse so fast.

SWHL commented 2 weeks ago

Hello, when I read the DocOwl1.5 paper, I meet another question about the special token. In the Table Parsing part of the paper, it writes you add special text tokens:

But I couldn't find the special token in the DocOwl 1.5 tokenizer's vocab:

{
  "bos_token": {
    "content": "<s>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "eos_token": {
    "content": "</s>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "pad_token": "<unk>",
  "unk_token": {
    "content": "<unk>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  }
}

And at the same time, I couldn't find related code in the repo source code.

So how do the colspan and rowspan server as special token? Thank you very much.

HAWLYQ commented 2 weeks ago

Hello, when I read the DocOwl1.5 paper, I meet another question about the special token. In the Table Parsing part of the paper, it writes you add special text tokens:

But I couldn't find the special token in the DocOwl 1.5 tokenizer's vocab:
{
  "bos_token": {
    "content": "<s>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "eos_token": {
    "content": "</s>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "pad_token": "<unk>",
  "unk_token": {
    "content": "<unk>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  }
}
And at the same time, I couldn't find related code in the repo source code.

So how do the colspan and rowspan server as special token? Thank you very much.

Hi, the 'special token' doesn't mean adding to the token file, just ditinguishing tokens to indicate row/column span, they are treated as normal texts~

SWHL commented 2 weeks ago

Thanks.