chongzhangFDU / ROOR-Datasets

This is the official release of the datasets introduced in the EMNLP 2024 paper: Modeling Layout Reading Order as Ordering Relations for Visually-rich Document Understanding.
Creative Commons Attribution 4.0 International
3 stars 0 forks source link

Clarification on Data Construction and ro_linkings Coherence Issue #2

Closed cryingjin closed 10 hours ago

cryingjin commented 10 hours ago

I have been closely following your related research and appreciate your work. However, I have some questions regarding the data construction. Was the dataset built automatically?

I expected that by using the ro_linkings, I would be able to logically connect the text within the document in a coherent flow. However, when I preprocess and concatenate the text according to the ro_linkings order, the result is quite disjointed and lacks coherence. Do you have any insights into what might be causing this issue?

e.g. 92657311_7313

Ordered Text: LTS. 100'S S. J. Farnham NO. STORES SUBMISSION DATE: EFFECTIVENESS OF PRE- SELL (REPORT ON OCT 3 ONLY). JAN 23, 1995 DISTRIBUTION: R. B. SPELL PROMOTIONAL IMPACT: 9 0 % CLASSIFIED CALLS 2 % ANNUAL CALLS 100'S % OF DISTRIBUTION ACHIEVED IN RETAIL OUTLETS: Excellent. Continues to drive all carton business SALES FORCE 20'S $ 50 OFF PACK: DIRECT ACCOUNT AND CHAIN VOIDS (USE X TO INDICATE A VOID). SUBJECT: DEC 26 NONE OCT 31 ACCOUNT OCT 3 FROM: HARLEY DAVIDSON 100'S CIGARETTES PROGRESS REPORT TO: $ 5.00 OFF CARTON: Excellent but quickly depleted. Excellent movement when couponed. Without coupons, movement slows dramatically!

cryingjin commented 10 hours ago

There are cases where the entity_id in ro_linkings does not actually exist.

chongzhangFDU commented 10 hours ago
  1. The dataset is built manually.
  2. Sorry for the misleading description. ro_linkings should be between segments but not entities, so the index within ro_linkings should be segment id but not entity_id. We have correct it in the README file.