X-PLUG / mPLUG-DocOwl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Apache License 2.0
1.12k stars 68 forks source link

DocDownstream-1.0 has duplicated enties #91

Closed cooleel closed 4 days ago

cooleel commented 1 week ago
ubuntu:/data/.cache/huggingface/hub/datasets--mPLUG--DocDownstream-1.0/snapshots/18fb3b5cb8b840dcde61be40ccf6247187af760b$ cat train.jsonl |head
{"image": ["./imgs/DUE_Benchmark/DocVQA/pngs/xnbl0037_1.png"], "messages": [{"role": "user", "content": "<|image|>what is the date mentioned in this letter?"}, {"role": "assistant", "content": "1/8/93"}], "task_name": "qa_sft", "dataset_name": "DocVQA"}
{"image": ["./imgs/DUE_Benchmark/DocVQA/pngs/xnbl0037_1.png"], "messages": [{"role": "user", "content": "<|image|>what is the date mentioned in this letter?"}, {"role": "assistant", "content": "1/8/93"}], "task_name": "qa_sft", "dataset_name": "DocVQA"}
{"image": ["./imgs/DUE_Benchmark/DocVQA/pngs/xnbl0037_1.png"], "messages": [{"role": "user", "content": "<|image|>what is the date mentioned in this letter?"}, {"role": "assistant", "content": "1/8/93"}], "task_name": "qa_sft", "dataset_name": "DocVQA"}
{"image": ["./imgs/DUE_Benchmark/DocVQA/pngs/xnbl0037_1.png"], "messages": [{"role": "user", "content": "<|image|>what is the contact person name mentioned in letter?"}, {"role": "assistant", "content": "p. carter"}], "task_name": "qa_sft", "dataset_name": "DocVQA"}
{"image": ["./imgs/DUE_Benchmark/DocVQA/pngs/xnbl0037_1.png"], "messages": [{"role": "user", "content": "<|image|>what is the contact person name mentioned in letter?"}, {"role": "assistant", "content": "p. carter"}], "task_name": "qa_sft", "dataset_name": "DocVQA"}
{"image": ["./imgs/DUE_Benchmark/DocVQA/pngs/xnbl0037_1.png"], "messages": [{"role": "user", "content": "<|image|>what is the contact person name mentioned in letter?"}, {"role": "assistant", "content": "p. carter"}], "task_name": "qa_sft", "dataset_name": "DocVQA"}

I found there are these duplicated entires in the train.jsonl, is this normal?

HAWLYQ commented 4 days ago

Hi, @cooleel some tasks are up-sampled in DocDownstream-1.0, so this is normal.

cooleel commented 4 days ago

Oh. okay, thanks for clarifying.