-
**Bug report**
The new hOCR renderer does not escape characters that need escaping. [This PDF](https://github.com/pdfminer/pdfminer.six/files/10032060/AandP.pdf) contains the string "A&P", which sh…
-
### Simple sanity checks
- [X] This is an issue with an app that uses OCRmyPDF for OCR
- [ ] I am using a recent version of the third party app
- [ ] I will include a file that reproduces the issuse
…
deict updated
1 month ago
-
Hi,
I'm extracting data from PDF with native text and some rows of the table have their content shuffled, as you can see in this [live example](https://colab.research.google.com/drive/1HyAe4eWbC2gH…
-
### Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
### Branch name
master
### Commit ID
无
### Other environment information
_No response_
### Actual beh…
-
## Value Statement
As someone who wants a boring way to use AI
I would like to expose an image/PDF/document to the LLM
So that I can make requests and extract information, all within Ramalama
…
-
Hello,
in Nextcloud it is not possible to index pdf content from scaned dokuments. The reason for this is the pdf file format itself. When you scan a document and save it to pdf there is no "real …
-
**Describe the bug**
User gets a `TesseractError` when processing a particular document.
**To Reproduce**
Code was an API call with a certain image-based document.
**Expected behavior**
Docum…
qued updated
5 months ago
-
你好,在试用pdf解析时,有问题想请教一下:
1.从category_id的类别上看,"category_id":1是plain_text正文自然段文本,"category_id":5中的latex是表格文本,但是我看到解析结果的json文件,发现"category_id":1没有text文本,只有"category_id":15的ocr_text的text文本,ocr_text是否可以理解为除…
-
**Describe the bug**
A strange one.
`IndexError: list index out of range` when OCR'ing a portion of a pdf doc, but depending on the split size, it doesn't always happen. My guess is that the firs…
cw5d updated
4 months ago
-
Hello Julian,
Great work. But it would have been a whole lot better if the OCR'd PDF is displayed in the browser. Can't we integrate pdf.JS or a simpler integration like (https://github.com/alekswe…