-
**Bug report**
Thanks for finding the bug! To help us fix it, please make sure that you
include the following information:
- A description of the bug
- Steps to reproduce the bug. Try to mini…
-
I don't understand how the logical tags in hOCR should be used. Moreover, I see potential conflicts with other nested tags from the layout. AFAIK ocropus itself does not use any logical tags and tesse…
-
No 404, just doesn't open
-
```
/home/muneeb/.local/bin/hocr-pdf:134: DeprecationWarning: decodestring() is a deprecated alias since Python 3.1, use decodebytes()
uncompressed = bytearray(zlib.decompress(base64.decodestring(…
-
Using recode_pdf (internetarchivepdf 1.5.2) and tesseract (5.3.0).
I have three examples single-pages, where I:
1. have tesseract make a full PDF from OCR, via eg `tesseract identifier.tiff i…
-
ALTO [supports a `@BASELINE` attribute](https://github.com/altoxml/schema/issues/32) that can define a polyline on which the text rests. [hOCR also includes support](http://kba.cloud/hocr-spec/1.2/#ba…
-
Some hOCR can't be parsed (0.6.0 version) becasue they use diacritics chars in content. For example chars: "**ůá**" words: **aráme, ků**
Ex hOCR file:
```
…
-
**Describe the bug**
I've tried the plain `pdftotree` command line utility on a few pdf files with tables, and found wherever there is a table structure, the last line is usually not captured in the …
-
Previously, we used pdfalto to generate an ALTO XML from the pdf and https://github.com/filak/hOCR-to-ALTO to convert the ALTO XML to hOCR file after that. With the newest release of pdfalto this does…
ghost updated
3 years ago
-
### Describe the proposed feature
Hi, I see there are a few issues on the board proposing integrations of new backends.
I wondered how difficult this would be to do naively: it turns out that'…