felix-schmitt / FormulaNet

FormulaNet is a new large-scale Mathematical Formula Detection dataset.
Creative Commons Attribution 4.0 International
14 stars 10 forks source link

CC BY 4.0

FormulaNet

FormulaNet is a new large-scale Mathematical Formula Detection dataset. It consists of 46'672 pages of STEM documents from arXiv and has 13 types of labels. The dataset is split into a train set of 44'338 pages and a validation set of 2'334 pages. Due to copyrights reasons, we can only provide the list of papers, which must be downloaded and processed.

Labels

Get FormulaNet (Docker option recommended)

Docker option

Prerequisites

The file structure should look like this:

.
├── ...
├── Dataset
│   ├── train
│   │     ├── img   # empty folder
│   │     └── train_coco.json
│   └── test
│         ├── img   # empty folder
│         └── test_coco.json
└── ...

build dockerfile (amd64 and arm64 supported)

    docker build -t formulanet --build-arg Platform='amd64' .

run the container with mounting the FormulaNet Folder

    docker run -v ~/<path to FormulaNet folder>/Dataset:/FormulaNet/Dataset formulanet

Classic option

Prerequisites

The file structure should look like this:

.
├── ...
├── Dataset
│   ├── train
│   │     ├── img   # empty folder
│   │     └── train_coco.json
│   └── test
│         ├── img   # empty folder
│         └── test_coco.json
└── ...

Install the python environment (recommended Python 3.8)

    pip install -r requirements.txt 

run the script

    python download.py 

Baseline Model

Model mAP mAP@50 mAP@75 mAP@inline mAP@display
FCOS-50 0.754±0.03 0.921±0.02 0.84±0.02 0.752±0.02 0.755±0.02
FCOS-101 0.755±0.03 0.920±0.02 0.841±0.02 0.756±0.02 0.749±0.03

The results can be reproduced by using these config files (FCOS-50, FCOS-101) and the github repo Yuxiang1995/ICDAR2021_MFD.

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

Citation

FormulaNet: A Benchmark Dataset for Mathematical Formula Detection

Felix M. Schmitt-Koopmann, Elaine M. Huang, Hans-Peter Hutter, Thilo Stadelmann, Alireza Darvishy

https://ieeexplore.ieee.org/document/9869643

@ARTICLE{9869643,
    author={Schmitt-Koopmann, Felix M. and Huang, Elaine M. and Hutter, Hans-Peter and 
    Stadelmann, Thilo and Darvishy, Alireza},  
    journal={IEEE Access},   
    title={FormulaNet: A Benchmark Dataset for Mathematical Formula Detection},   
    year={2022},  
    volume={10},  
    number={},  
    pages={91588-91596},  
    doi={10.1109/ACCESS.2022.3202639}}