duchallenge-team / dude

DUDE: Document UnderstanDing of Everything Benchmark
GNU General Public License v3.0
6 stars 0 forks source link

DUDE: Document UnderstanDing of Everything Benchmark

License: GPL v3

Shared repository to work with the DUDE benchmark, used in the ICDAR 2023 Competition on Document UnderstanDing of Everything. The competition deadline is on April 20, 2023. Be sure to check the RRC platform for the latest updates, replicated here under Announcements.

The repository collects a number of tools:

Table of Contents:

Download the dataset

The dataset is publicly available by following the links in https://rrc.cvc.uab.es/?ch=23&com=downloads. This requires you to register on the RRC platform, only so that we can keep track of how many participants are interested in our competition. You can also download the binaries (PDF & OCR) and unzip in a custom data_dir.

Load the dataset

The suggested way to load the dataset is to start from https://huggingface.co/datasets/jordyvl/DUDE_loader

from datasets import load_dataset
ds = load_dataset("jordyvl/DUDE_loader", 'Amazon_original') #automatically downloads binaries tar and extracts to HF_CACHE
ds = load_dataset("jordyvl/DUDE_loader", 'Amazon_original', data_dir="/DUDE_train-val-test_binaries") #with custom extracted data directory

The second argument loads a specific OCR configuration; have a look at DUDEConfig to understand how to call different versions.

Additionally, the data loader repository includes a script to convert a dataset to the ImDB format, popularly used in visual question answering benchmarks.

Predictions format and running evaluation

Check out our standalone repository which explains it all: https://github.com/Jordy-VL/DUDEeval

Pre-computed OCR

We provide OCR outputs to help participants of DUDE. Note it is not required to use the attached OCRs, and you can use your own preferred OCR service (as long as you mention it with your submission).

Specifically, available provided OCRs include outputs of:

The output of Azure and Amazon OCRs was obtained from the PDF files. Since Tesseract does not support PDF inputs, we converted them to TIFFs (200 dpi) before running the process. There were three files where this process failed due to format limitations.

In addition to software-specific outputs (_original), we provide OCRs in the unified form (_due) introduced by the authors of DUE Benchmark (https://github.com/due-benchmark). Please find a toy reader for any of these below:

import json
from typing import Dict, List, Tuple, Literal
def read_document(
        file_id: str,
        subset: Literal['train', 'val', 'test'] = 'train',
        ocr_engine: Literal['Azure', 'Amazon', 'Tesseract'] = 'Azure'
    ) -> Tuple[List[Image], Dict]:

    # Read OCR results in DUE format
    with open(f'OCR/{ocr_engine}/{file_id}_due.json') as ins:
        data: Dict = json.load(ins)

Dataset and benchmark paper (under progress)

The dataset, the benchmark tasks, and the evaluation criteria are described in detail in the [dataset paper](). To cite the dataset, please use the following BibTeX entry:

@inproceedings{dude2023icdar,
    title={ICDAR 2023 Competition on Document UnderstanDing of Everything (DUDE)},
    author={Van Landeghem, Jordy, Łukasz Borchmann, Rubèn Tito, Michał Pietruszka, Dawid Jurkiewicz, Rafał Powalski, Paweł Józiak, Sanket Biswas, Mickaël Coustaty and Tomasz Stanisławek},
    booktitle={Proceedings of the ICDAR 2023},
    year={2023}
}

Announcements

FYI, see discussions tab :)