Open sadra-barikbin opened 8 months ago
That sounds very reasonable. @sadra-barikbin, do you want to provide a pull request with your fix for the prima-core-libs.
@chris1010010, is this repository still maintained?
Currently I don't have time, unfortunately. C.
On Tue, 6 Feb 2024, 17:14 Stefan Weil, @.***> wrote:
@chris1010010 https://github.com/chris1010010, is this repository still maintained?
— Reply to this email directly, view it on GitHub https://github.com/PRImA-Research-Lab/prima-core-libs/issues/17#issuecomment-1930396136, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4A5RRAIZTGRYO7V254PL3YSJQGRAVCNFSM6AAAAABB2D2OJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZQGM4TMMJTGY . You are receiving this because you were mentioned.Message ID: @.***>
@sadra-barikbin, I applied your fix in https://github.com/UB-Mannheim/prima-core-libs/releases/tag/1.5.02 with commit https://github.com/UB-Mannheim/prima-core-libs/commit/3868892a5947f27a09cc50375f7fe56fb98d5592. Based on that, I created https://github.com/UB-Mannheim/prima-page-converter/releases/tag/1.5.06 with a fixed JPageConverter.
Our latest ocr-fileformat now uses that fixed JPageConverter.
@stweil :+1: on shipping your own build.
You may want to consider also merging #16 into your prima-core-libs build and make another release. This addresses the frequent problem that parsing fails for some reason, but that reason does not get shown to the user.
Good idea. Maybe I can do that tomorrow.
Done, see new release 1.5.03. Please note that I have not tested it yet. As soon as testing is done, new releases for dependent repositories can be made.
For testing, the simplest would be to create examples by "breaking" files drawn from some public GT. Like the two causes I described. But you can throw in other errors for good measure (like empty ReadingOrder
or empty Unicode
or conflicting segment @id
or conflicting TextEquiv/@index
, or even plain schema invalidities or even XML invalidities.
Hi there! I was using UB-Mannheim's
ocr-fileformat
to convert hOCR to PAGE XML. It internally uses PRImA.PageConverter to do the job which itself depends on this java library. It succesfully converted the file except that it put an extra quotation mark next to the image file name in the PAGE XML output causing mismatch between XML file and the image: (Note the extra quotation mark below)I delved into the problem and it turned out that when there's no seperator in the file name, the hOCR handler extracts it incorrectly. https://github.com/PRImA-Research-Lab/prima-core-libs/blob/1bdcc5720d8805b431196c373bcc36633776dfee/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_Hocr.java#L311-L325
Above, the line 319 should become
to fix the issue because
part.indexOf(" \"")
returns the index of space character not the"
.@stweil @bertsky