Possible small bug in `SaxPageHandler_Hocr.java`

PRImA-Research-Lab / prima-core-libs

Core libraries by the PRImA Research Lab

Apache License 2.0

16 stars 15 forks source link

Possible small bug in `SaxPageHandler_Hocr.java` #17

Open sadra-barikbin opened 8 months ago

sadra-barikbin commented 8 months ago

Hi there! I was using UB-Mannheim's ocr-fileformat to convert hOCR to PAGE XML. It internally uses PRImA.PageConverter to do the job which itself depends on this java library. It succesfully converted the file except that it put an extra quotation mark next to the image file name in the PAGE XML output causing mismatch between XML file and the image: (Note the extra quotation mark below)

I delved into the problem and it turned out that when there's no seperator in the file name, the hOCR handler extracts it incorrectly. https://github.com/PRImA-Research-Lab/prima-core-libs/blob/1bdcc5720d8805b431196c373bcc36633776dfee/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_Hocr.java#L311-L325

Above, the line 319 should become

            image = part.substring(part.indexOf(" \"")+2);

to fix the issue because part.indexOf(" \"") returns the index of space character not the ".

@stweil @bertsky

stweil commented 7 months ago

That sounds very reasonable. @sadra-barikbin, do you want to provide a pull request with your fix for the prima-core-libs.

stweil commented 7 months ago

@chris1010010, is this repository still maintained?

chris1010010 commented 7 months ago

Currently I don't have time, unfortunately. C.

On Tue, 6 Feb 2024, 17:14 Stefan Weil, @.***> wrote:

@chris1010010 https://github.com/chris1010010, is this repository still maintained?

— Reply to this email directly, view it on GitHub https://github.com/PRImA-Research-Lab/prima-core-libs/issues/17#issuecomment-1930396136, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4A5RRAIZTGRYO7V254PL3YSJQGRAVCNFSM6AAAAABB2D2OJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZQGM4TMMJTGY . You are receiving this because you were mentioned.Message ID: @.***>

stweil commented 7 months ago

@sadra-barikbin, I applied your fix in https://github.com/UB-Mannheim/prima-core-libs/releases/tag/1.5.02 with commit https://github.com/UB-Mannheim/prima-core-libs/commit/3868892a5947f27a09cc50375f7fe56fb98d5592. Based on that, I created https://github.com/UB-Mannheim/prima-page-converter/releases/tag/1.5.06 with a fixed JPageConverter.

Our latest ocr-fileformat now uses that fixed JPageConverter.

bertsky commented 7 months ago

@stweil :+1: on shipping your own build.

You may want to consider also merging #16 into your prima-core-libs build and make another release. This addresses the frequent problem that parsing fails for some reason, but that reason does not get shown to the user.

stweil commented 7 months ago

Good idea. Maybe I can do that tomorrow.

stweil commented 7 months ago

Done, see new release 1.5.03. Please note that I have not tested it yet. As soon as testing is done, new releases for dependent repositories can be made.

bertsky commented 7 months ago

For testing, the simplest would be to create examples by "breaking" files drawn from some public GT. Like the two causes I described. But you can throw in other errors for good measure (like empty ReadingOrder or empty Unicode or conflicting segment @id or conflicting TextEquiv/@index, or even plain schema invalidities or even XML invalidities.