Closed ponchofiesta closed 5 years ago
The code is ok I guess. I just looked at your xml and it seems that this part between line 5824 - 5826 causes the error.
`
OK but AFAIK this is valid. A space character is a character too. Shouldn't this be converted to SP in Alto? Or is the problem that this is the only char in this line/paragraph? I think this should be caught by the library then(?)
EDIT: OK, XML ignores white space characters. Maybe there is a way to enforce white-space chars.
EDIT:
I don't think this is the problem. There are several chars that contain only a whitespace character. They all are converted to CONTENT=" "
. But when there is one single whitespace char in a line and paragraph, it crashes.
Hm, what I can do is ignoring it. That would be a fast fix. Something like (I would do it a bit more pretty cause Im not sure what happens if there are multiple spaces but no other content):
int contentSize = altoLine.getStringAndSP().size(); if(contentSize == 0 || (contentSize <= 1 && altoLine.getStringAndSP().get(0) instanceof SP)) { textBlock.getTextLine().remove(altoLine); }
That would leave an empty TextBlock. To get rid of this I would also add an additional "if":
if(!paragraphBlock.getTextLine().isEmpty()) { composedBlock.getContent().add(paragraphBlock); }
For me that would be fine and I can easily add it. And I dont think you really lose any information. What you think?
In my code it would be OK. My ALTO reader skips SP too. In my workflow the OCR data is written to PDFs with the scanned images. So I don't need whitespaces. Not sure if someone would need those data. But I don't want to modify our "raw" XMLs. If it throws exceptions, nobody needs it :)
I hope not :). Should be fixed now.
I'll test it on monday. Thank you!
I tested it with one problematic file and it worked.
As this exception is thrown in an Java XML class I assume it's hard to fix but maybe it is caused in your library. When trying to convert this file, A NullPointerException is thrown.
Using OpenJDK-8 on Ubuntu 16.04
This is the code I used to convert the file: