UB-Mannheim / ocrd_pagetopdf

OCR-D wrapper for prima-pagetopdf
Apache License 2.0
8 stars 6 forks source link

itextpdf installation does not work #4

Closed bertsky closed 4 years ago

bertsky commented 4 years ago

After doing make install on a Ubuntu 18.04 with OpenJDK 11.0.5 and running on an example workflow, I get:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.itextpdf.text.io.ByteBufferRandomAccessSource$1 (file:venv/share/ocrd_pagetopdf/ptp/lib/itextpdf-5.5.2.jar) to method java.nio.DirectByteBuffer.cleaner()
WARNING: Please consider reporting this to the maintainers of com.itextpdf.text.io.ByteBufferRandomAccessSource$1
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

And the PDFs generated have no text layer or visual annotation.

Is there a specific Java runtime required?

JKamlah commented 4 years ago

The prima developer don't use the current version of itextpdf. I guess this problem can derive from that. Sorry, i am not really deep into the specifics here. Maybe someone else can help.

bertsky commented 4 years ago

Okay, so switching to OpenJDK 8 helped make that warning go away.

But I still get no text layer! (My source file group had TextEquiv at the word, line and region level.)

How is this supposed to work?

JKamlah commented 4 years ago

Did you set the "-text-source" parameter correctly? https://github.com/JKamlah/ocrd_pagetopdf/blob/master/ocrd-tool.json#L38

JKamlah commented 4 years ago

Maybe i should set it to std value like "T"..

bertsky commented 4 years ago

Did you set the "-text-source" parameter correctly? https://github.com/JKamlah/ocrd_pagetopdf/blob/master/ocrd-tool.json#L38

Here's what I did:

ocrd-pagetopdf -m assets/data/kant_aufklaerung_1784/data/mets.xml -I OCR-D-GT-PAGE,OCR-D-IMG -O OCR-D-PDF -p '{"outlines": "T", "text-source": "T"}'

where assets is our GT test repo.

JKamlah commented 4 years ago

My bad.. It works with: ocrd-pagetopdf -m assets/data/kant_aufklaerung_1784/data/mets.xml -I OCR-D-GT-PAGE,OCR-D-IMG -O OCR-D-PDF -p '{"outlines": "R", "text-source": "R"}'

bertsky commented 4 years ago

The data seems to have some negative coordinate values, with the added script it works: ocrd-pagetopdf -m assets/data/kant_aufklaerung_1784/data/mets.xml -I OCR-D-GT-PAGE,OCR-D-IMG -O OCR-D-PDF -p '{"outlines": "T", "text-source": "R", "negative2zero":true}'

No, negative2zero makes no difference. But I do get text when I set text-source other than T (both word and region level works – without negative2zero). So at least there is a bug with at the textline level.

Also, I don't see outlines other than on the word level. Perhaps because the other levels have non-rectangular polygons?

bertsky commented 4 years ago

Another problem seems to be that letters like ſ are lost.

JKamlah commented 4 years ago

The loosing letter problem, should be solved by using another font

JKamlah commented 4 years ago

"font":"/usr/share/fonts/truetype/ubuntu/UbuntuMono-R.ttf"

bertsky commented 4 years ago

The loosing letter problem, should be solved by using another font "font":"/usr/share/fonts/truetype/ubuntu/UbuntuMono-R.ttf"

Indeed, thanks!

JKamlah commented 4 years ago

I think you could be right with the polygons. I will test it with a transformation..

bertsky commented 4 years ago

So at least there is a bug with at the textline level.

The reason is simply that this has since been renamed from T to L!

JKamlah commented 4 years ago

Thanks: 4933e4c07dae85680b963af97295a0df45a323b9

bertsky commented 4 years ago

I think you could be right with the polygons. I will test it with a transformation..

Should not be the reason: the converter appears to use polygons itself.

bertsky commented 4 years ago

@JKamlah can you please document the JDK version required (with a pointer to prima-page-to-pdf, in case that should change in the future)?