adithya-s-k / omniparse

Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
https://docs.cognitivelab.in
GNU General Public License v3.0
3.96k stars 317 forks source link

Tables in pdf cannot be processed properly #18

Open delcompan opened 1 week ago

delcompan commented 1 week ago

The content before and after the table in the pdf file cannot be processed normally, and the content of the table and the content of the article will be confused.

pdf-talbe

Methods The recommendations are graded based on a modified Delphi methodology with categorization as previously described (Table 1) [3]. The methods for this document build upon a 2001 publication Table 1 Grading system Grading recommendations

A. Supported by at least 2 level I investigations B. Supported by 1 level I investigation C. Supported by level II investigations only D. Supported by at least 1 level III investigation E. Supported by level IV or V evidence Grading of evidence I. Large, randomized trials with clearcut results; low risk of false-positive (alpha) error or false-negative (beta) error II. Small, randomized trials with uncertain results; moderate-tohigh risk of false-positive (alpha) and/or false-negative (beta) error III. Non-randomized, contemporaneous controls IV. Non-randomized, historical controls and expert opinion V. Case series, uncontrolled studies, and expert opinion sponsored by the International Sepsis Forum, and use the same method of recommendation grading [4]. The grading system was applied to the question from which each recommendation is created. The supplement submission includes background material, questions, and expanded rationale. This executive summary is targeted to be concise and user friendly for the bedside clinician.

There are many other contents that cannot be recognized normally pdf

The following is the processed content pdf-2

delcompan commented 1 week ago

Below is the original document. Chunking strategy chooses semantic chunking

2004-Surviving Sepsis Campaign guidelines for management of severe sepsis and septic shock.pdf

adithya-s-k commented 1 week ago

@delcompan We are currently using SuryaOCR and Marker for parsing the documents which might have some limitation when parsing documents with complex structures please refer to the following repository for futher insights Marker and Surya

currently training few models to fix edge cases like you have mentioned above

Thank you for yours input

delcompan commented 1 week ago

Thanks, looking forward to your updates!

alalulu8668 commented 4 days ago

The tool is excellent! For smaller PDF files, table and image extraction work quite well. However, with larger files, many figures are not extracted correctly, with figure names and annotations often being cut off. I'm looking forward to future updates that address these issues!