Closed bertsky closed 4 years ago
The prima developer don't use the current version of itextpdf. I guess this problem can derive from that. Sorry, i am not really deep into the specifics here. Maybe someone else can help.
Okay, so switching to OpenJDK 8 helped make that warning go away.
But I still get no text layer! (My source file group had TextEquiv
at the word, line and region level.)
How is this supposed to work?
Did you set the "-text-source" parameter correctly? https://github.com/JKamlah/ocrd_pagetopdf/blob/master/ocrd-tool.json#L38
Maybe i should set it to std value like "T"..
Did you set the "-text-source" parameter correctly? https://github.com/JKamlah/ocrd_pagetopdf/blob/master/ocrd-tool.json#L38
Here's what I did:
ocrd-pagetopdf -m assets/data/kant_aufklaerung_1784/data/mets.xml -I OCR-D-GT-PAGE,OCR-D-IMG -O OCR-D-PDF -p '{"outlines": "T", "text-source": "T"}'
where assets
is our GT test repo.
My bad.. It works with: ocrd-pagetopdf -m assets/data/kant_aufklaerung_1784/data/mets.xml -I OCR-D-GT-PAGE,OCR-D-IMG -O OCR-D-PDF -p '{"outlines": "R", "text-source": "R"}'
The data seems to have some negative coordinate values, with the added script it works: ocrd-pagetopdf -m assets/data/kant_aufklaerung_1784/data/mets.xml -I OCR-D-GT-PAGE,OCR-D-IMG -O OCR-D-PDF -p '{"outlines": "T", "text-source": "R", "negative2zero":true}'
No, negative2zero
makes no difference. But I do get text when I set text-source
other than T
(both word and region level works – without negative2zero
). So at least there is a bug with at the textline level.
Also, I don't see outlines other than on the word level. Perhaps because the other levels have non-rectangular polygons?
Another problem seems to be that letters like ſ
are lost.
The loosing letter problem, should be solved by using another font
"font":"/usr/share/fonts/truetype/ubuntu/UbuntuMono-R.ttf"
The loosing letter problem, should be solved by using another font "font":"/usr/share/fonts/truetype/ubuntu/UbuntuMono-R.ttf"
Indeed, thanks!
I think you could be right with the polygons. I will test it with a transformation..
So at least there is a bug with at the textline level.
The reason is simply that this has since been renamed from T
to L
!
Thanks: 4933e4c07dae85680b963af97295a0df45a323b9
I think you could be right with the polygons. I will test it with a transformation..
Should not be the reason: the converter appears to use polygons itself.
@JKamlah can you please document the JDK version required (with a pointer to prima-page-to-pdf
, in case that should change in the future)?
After doing
make install
on a Ubuntu 18.04 with OpenJDK 11.0.5 and running on an example workflow, I get:And the PDFs generated have no text layer or visual annotation.
Is there a specific Java runtime required?