DS4SD / docling-ibm-models

MIT License
44 stars 10 forks source link

TF response to HTML #29

Open mllife opened 2 months ago

mllife commented 2 months ago

Any helper code available in repo to do this?

I see some code (related to dataset conversion?)

Not sure about this - -- https://github.com/DS4SD/docling-ibm-models/blob/620ce428c66928e670d47004bbb563e1779070e4/docling_ibm_models/tableformer/data_management/tf_predictor.py#L1086

Any insight will be helpful.

maxmnemonic commented 2 months ago

Tableformer generates structure predictions in OTSL+ format (OTSL with header support), to convert OTSL structure represented as list of OTSL tags, to HTML structure (list of HTML tags) you can use this function: otsl_to_html

OTSL format described in our paper: Optimized Table Tokenization for Table Structure Recognition, there are big benefits in quality and performance to use it. It has a limited vocabulary: "ecel" - empty cell "fcel" - full cell "lcel" - left-looking span cell "ucel" - up-looking span cell "xcel" - cross cell (or 2d span cell) "nl" - new line More semantics and logic behind it we describe in a paper.

OTSL+ is extension of OTSL with extra tags or instructions that describe cells of: "ched" - column headers "rhed" - row headers "srow" - section rows

Model predicts these tags sequentially in tag decoder, simultaneously with bounding boxes from bbox decoder. then we can convert prediction to any other format, ie MD, HTML, etc.

By the way more high level usage of docling-ibm-models can be seen in docling itself: https://github.com/DS4SD/docling

mllife commented 2 months ago

@maxmnemonic , can you link to the code to do the same or add this as a test or sample notebook to the current repo? it will be really helpful for everyone. thanks

maxmnemonic commented 2 months ago

Thanks for suggestion @mllife, indeed we can add some good examples purely related to tables in this repo

mllife commented 1 week ago

@maxmnemonic , any update to this? can you add some sample code for this or some test like this https://github.com/DS4SD/docling-ibm-models/blob/main/tests/test_tf_predictor.py