Academic-Hammer / SciTSR

Table structure recognition dataset of the paper: Complicated Table Structure Recognition
https://arxiv.org/pdf/1908.04729.pdf
MIT License
345 stars 57 forks source link

List of complicated tables for training is missing ? #41

Closed doralune closed 1 year ago

doralune commented 1 year ago

I cannot find the training set of complicated tables (2,885 items) in the current download link.

doralune commented 1 year ago

README says that only the test set is provided, so it might not exist in the first place. I have created the train set (2,885 items) with the python functions below and put the download link here. I could also reproduce the test set (716 items) with the same functions, so it might be correct.

from pathlib import Path
import json

def is_complicated_structure(structure_file: Path):
    with open(structure_file, "r") as f:
        a_dict = json.load(f)
    for cell in a_dict["cells"]:
        if is_merged_cell(cell) and is_non_empty_cell(cell):
            return True
    return False

def is_merged_cell(cell):
    if cell["start_row"] != cell["end_row"]:
        return True
    if cell["start_col"] != cell["end_col"]:
        return True
    return False

def is_non_empty_cell(cell):
    #return len(cell["tex"]) > 0
    return len(cell["content"]) > 0