Improve OCR Accuracy for Complex Scientific Tables

taiwanhuachenyu commented 1 month ago

When recognizing complex tables containing scientific data, the current OCR system exhibits several accuracy issues. The main problems identified are:

Inaccurate table structure recognition: The system fails to correctly identify and preserve the original table's column and row structure and their relationships. Column header recognition failure: Important column headers such as "Antibody", "VH Chain", "VL Chain" are not correctly recognized, resulting in loss of data context. Data association errors: Values are not correctly associated with their corresponding column headers, leading to confusion between data from different columns. Compromised data integrity: Some values (such as binding affinity KD values) are incorrectly split or combined, affecting data accuracy. Special character and abbreviation recognition issues: Scientific notations like "SEQ ID NO:" and units such as "nM" are not correctly recognized or preserved.

Suggested improvements:

Enhance recognition capabilities for structured scientific data. Improve algorithms for column header and table header recognition. Increase accuracy in matching values to their corresponding columns. Optimize recognition of scientific notations and units.

VikParuchuri commented 1 month ago

I am unable to reproduce this using the image you provided:

Sometimes PDFs have bad text in them. In this case, use the "detect cell bboxes" option to re-detect the cells and re-OCR the text. By default, the table text will be extracted from the PDF.

conjuncts commented 1 month ago

I don't know if this is applicable at all, but I happened to get a similar looking output when passing a table_bbox which didn't match the highres_image. When I cropped the highres_image to match the same size as the table_bbox, it was fixed.

VikParuchuri / tabled

Improve OCR Accuracy for Complex Scientific Tables #6