PRImA-Research-Lab / prima-page-viewer

Java based viewer for PAGE XML files (layout + text content). Also supports ALTO XML, FineReader XML, and HOCR.
Apache License 2.0
34 stars 9 forks source link

show Glyphs in ALTO 4 #7

Closed bertsky closed 4 years ago

bertsky commented 4 years ago

It's great that PageViewer already supports ALTO v4. But it seems that Glyph elements are not displayed yet (as they are for PAGE). Is it planned to add that anytime soon?

(I would like to help, but I cannot even find where ALTO gets imported. Is this actually in prima-core-libs or prima-page-converter?)

chris1010010 commented 4 years ago

I'll have a look when I have time. It's in core libs PrimaDla org.primaresearch.dla.page.io.xml.sax.SaxPageHandler_Alto_2_1 (it's ALTO 2.1 upwards)

bertsky commented 4 years ago

Thanks!

SaxPageHandler_Alto_2_1 looks very promising, I'd like to try extending it, but I have trouble getting all the PRImA projects to build in the first place. I even got to manually import the various libraries and repos into Eclipse (as existing projects, sometimes removing fixed paths like for GWT, or as new Java projects where no .project was present). But alas, they give me tons of error messages when I try to build. Without instructions or documentation, this is just too much effort for me.

chris1010010 commented 4 years ago

Sorry for that, I thought building would be easier. I'll remove the GWT stuff anyway soon I think. Hope that will improve things

chris1010010 commented 4 years ago

I made an update, have a look if it works for you (I don't have proper examples for ALTO with glyphs)

bertsky commented 4 years ago

It works – perfectly! Thanks!

(I don't have proper examples for ALTO with glyphs)

Above mentioned PR will add that functionality to Tesseract. (It's currently tesseract -l eng -c document_title=input.tif input.tif input.alto alto to arrive at a input.alto.xml file.)

mrocr commented 4 years ago

@chris1010010 why you didn't merge the update at master? I see it in release

bertsky commented 4 years ago

@mrocr It's a change in external library code only! PrimaDla