-
Progress has been made on text extraction from PDF.
It would be good to integrate a process like the one of https://github.com/VikParuchuri/marker and https://github.com/VikParuchuri/surya.
That wo…
-
The Library is unable to fetch text under Manufacturer/Model: However we are able to see it via AWS Textract console
-
It looks like the only way to capture the output of amazon-textract is to redirect it into a file. Such as:
amazon-textract --input-document "s3://somebucket/2022-04-16-0010.jpg" --pretty-print LI…
-
the attached input document contains text then a table followed by some text, we want the text file to be the same as the input pdf file.
![input_page](https://github.com/user-attachments/assets/fe…
-
**Is your feature request related to a problem? Please describe.**
Tesseract does not handle the PDF's I'd like to OCR strong enough.
**Describe the solution you'd like**
I want to be able to…
-
Exception is occuring while running a code- Exception in thread "main" java.lang.IllegalArgumentException: Invalid option: software.amazon.awssdk.awscore.client.config.AwsClientOption@44e81672. Requi…
-
When I send a PDF with the following paragraph (which is a bit tilted, part of [this PDF file](https://www.accessdata.fda.gov/cdrh_docs/pdf/P010032A.pdf))
and use `Document.get_text()`, I get the f…
-
Was trying to get `pipeline_merge_tables` working and ended up finding a small issue. The default validation function breaks when there are no tables in the current or next page, which means that the …
-
The current implementation extracts the ReadingOrder from the top-level parents of all `WORD` blocks (in the order of these word blocks). This seems to be necessary for cases with `TABLE` results.
…
-
I noticed that even when testing extreme values of heuristic_line_break_threshold, heuristic_overlap_ratio, and heuristic_h_tolerance there was no change in the output. This led me to examine their us…