-
I wish to differentiate a dotted line vs full line. attaching a sample here.
[Buprenorphine.pdf](https://github.com/jsvine/pdfplumber/files/6163296/Buprenorphine.pdf)
Here I want to ignore dotted li…
-
Leading on from #2
## Proposed features
- Zenodo/Zotero/Google Sheet as faceted sources.
## Ideas
- #6
- #7
- https://github.com/digipres/publications/issues/8 and for search!
- h…
-
To properly extract certain text in PDF, it may be necessary to detect/group lines, identify tables, equations. This may either be done post-extraction of objects or before, depending on what is easi…
-
# table2matrix
Datasheets contain merged cells if a unit or condition applies to multiple rows. headers might also be merged. when iterating the data row wise, we need to first resolve the merged ce…
-
Hi Team, Can someone help me to modify the code to process all the document with .pdf extension and process it through docAi and load into BQ:
I tried below but when I run #python main.py, nothing…
-
attached the part of the pdf, which I am trying to extract.
I am doing extraction using:
textract_json = call_textract(input_document="s3:url",
features=[Textract_Featur…
-
Hi,
I'm extracting data from PDF with native text and some rows of the table have their content shuffled, as you can see in this [live example](https://colab.research.google.com/drive/1HyAe4eWbC2gH…
JbIPS updated
3 weeks ago
-
Hello
Thanks for this great lib which bring much convenience to me.
I want to reflect two problems I met with it.
1. When the table has one cell which contains text with blue color and no backgro…
-
We can generalize the algorithm inside [the PDF plugin](https://github.com/turicas/rows/tree/feature/plugin-pdf) to receive objects from an OCR and then extract tables from images!
The tasks related …
-
**Describe the bug**
A strange one.
`IndexError: list index out of range` when OCR'ing a portion of a pdf doc, but depending on the split size, it doesn't always happen. My guess is that the firs…
cw5d updated
5 months ago