Failed to find the annotation of QA tasks and Logical Relationsips

UniModal4Reasoning / DocGenome

DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Models

https://unimodal4reasoning.github.io/DocGenome_page/

Creative Commons Attribution 4.0 International

133 stars 4 forks source link

Failed to find the annotation of QA tasks and Logical Relationsips #3

Closed BrownXing closed 3 months ago

BrownXing commented 4 months ago

Thank you for sharing this awesome work! I have downloaded the dataset (training) from the google drive provided in the repo, while I met some problem when trying to find the label of QA Tasks and Logical Relationsips. Could you please indicate where to find the test-set containing QA annotations? Besides, i failed to find the Logical Relationsips in the training set. The data file of a single docuemnt is organized as : file_name |-layout_annotations.json |-order_annotations.json |-page_xxx.jpg |-quality_report.json |-reading_annotations.json The values of 'previous_block', 'parent_block' and 'next_block' in the order_annotations.json is null. Did I overlook any key points to parse the dataset?

BOBrown commented 3 months ago

@BrownXing Thanks for your attention in our work. As described in Sec. 4.2 in our paper, we design QA pair only in DocGenome-test, which will be released on huggingface within the next two days.

BrownXing commented 3 months ago

Thank you very much! Additionally, I have not located the annotations about the Logical Relationships between the component units. Would you please help me to find the file ?

MaoSong2022 commented 3 months ago

@BrownXing Please refere to order_annotations.json for Logical Relationships annotations. Usually, there are two fields in the order_annotations.json, annotations and orders:

the orders is a list containing several triples, each triple representing a relationship type between two bounding boxes with id from and to.

orders

the annotations is also a list containing necessary information about the bounding box.

(example comes from "astro-ph.CO/0911.2655")

Due to time budget, the number of Logical Relationships annotations in training dataset are less than in test dataset. Specifically, "implicit-cite" relation doesn't show up in training dataset; "explicit-cite" doesn't contain cross reference between texts and float environments such as tables and figures.

Besides, since this is a large project, some annotations may be lost in early stage. For example, some order_annotations.json may not contain "annotations"...

Feel free to ask any further questions!

BrownXing commented 3 months ago

Thank you very much, I have found it. Once again appreciating your excellent work!

Ao-Last commented 3 months ago

@BrownXing Thanks for your attention in our work. As described in Sec. 4.2 in our paper, we design QA pair only in DocGenome-test, which will be released on huggingface within the next two days.

Did you release the test split? I'd like to reference it as one of benchmarks :)

sky-fly97 commented 3 months ago

@BrownXing Thanks for your attention in our work. As described in Sec. 4.2 in our paper, we design QA pair only in DocGenome-test, which will be released on huggingface within the next two days.

Did you release the test split? I'd like to reference it as one of benchmarks :)

Thank you for your interest and we expect to release it within the week.

sky-fly97 commented 3 months ago

@BrownXing Thanks for your attention in our work. As described in Sec. 4.2 in our paper, we design QA pair only in DocGenome-test, which will be released on huggingface within the next two days.

Did you release the test split? I'd like to reference it as one of benchmarks :)

Hello, we have released testset, you can download it here