Closed SWHL closed 2 weeks ago
Hi @SWHL, there is no special treatment for such identifiers. They are treated as normal text and just used to distinguish texts parsed from images with normal answers given by the LLM.
Thanks for your reponse. It's very clear.
I have another question about DocDownStream1.0, which does not appear in the paper. Where is the dataset used?
Hi @SWHL, DocDownStream1.0 is used in 2nd-stage training~
Thanks for your reponse so fast.
Hello, when I read the DocOwl1.5 paper, I meet another question about the special token.
In the Table Parsing part of the paper, it writes you add special text tokens:
But I couldn't find the special token in the DocOwl 1.5 tokenizer's vocab:
{
"bos_token": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": "<unk>",
"unk_token": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}
And at the same time, I couldn't find related code in the repo source code.
So how do the colspan
and rowspan
server as special token?
Thank you very much.
Hello, when I read the DocOwl1.5 paper, I meet another question about the special token. In the Table Parsing part of the paper, it writes you add special text tokens:
But I couldn't find the special token in the DocOwl 1.5 tokenizer's vocab:
{ "bos_token": { "content": "<s>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false }, "eos_token": { "content": "</s>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false }, "pad_token": "<unk>", "unk_token": { "content": "<unk>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false } }
And at the same time, I couldn't find related code in the repo source code.
So how do the
colspan
androwspan
server as special token? Thank you very much.
Hi, the 'special token' doesn't mean adding to the token file, just ditinguishing tokens to indicate row/column span, they are treated as normal texts~
Thanks.
Hello, thank you for your excellent work.
I am curious about the identifiers
<doc></doc>
,<ocr></ocr>
,<md></md>
etc. introduced in the DocStruct4M dataset to unify various tasks. I did not see any special treatment for them in the training code, and they did not appear as special tokens.I would like to ask, during training, are these introduced identifiers treated as normal text? Or are there other processing techniques?
Thank you.