Handling of Invalid sofa indexes

hatzel commented 3 years ago

Is your feature request related to a problem? Please describe. I had to load data with invalid sofa indexes, don't ask me how they got in there. They are just comically out of bounds (in the hundreds of thousands when the document length is in the tenths of thousands).

Describe the solution you'd like I fixed this for myself with this hack which just completely discards indexes in such a case: https://github.com/hatzel/dkpro-cassis/commit/8765c42d5bc9fe6520b27475fbda2c6d34746816

Doesn't feel great, let me know if you want to take something like this on board. Otherwise feel free to close this issue.

Let me know how/if you would like to handle this, I can provide a minimal example and potentially a better fix if you are interested.

Describe alternatives you've considered You could just emit a warning but I am unsure if that would really be a great solution.

jcklie commented 3 years ago

Thank you for the report. Do you have lots of unicode in your documents? Is it possible to create a minimal example of what does not work? How did you create the documents in the first place? I fear that the index mapping code is not 100% reliable and want to understand the error. I would add an optional flag to ignore index errors, the problem then just it that writing back will not work.

hatzel commented 3 years ago

Alright, I suspect this may actually at least in part be an error in the documents. I messed around trying to minimalize the examples a bit but didn't really get anywhere so since the files are open source I'll just link them here, maybe you can make sense of it.

The typesystem can be found here but I had to extend it manually with a few missing annotations:

    with open(typesystem_path, "rb") as f:
        typesystem = cassis.load_typesystem(f)
        # We have to add some types that apparently are not in the XML
        ep = typesystem.create_type("de.uniwue.mk.kall.Erzaehlpassage")
        sa = typesystem.create_type("de.uniwue.mk.kall.Sprechakt")
        typesystem.add_feature(type_=sa, name="Aufbau", rangeTypeName="uima.cas.String")
        typesystem.create_type("de.uniwue.mk.kall.SprechaktText")
        dialogue = typesystem.create_type("de.uniwue.mk.kall.Dialog")
        typesystem.add_feature(type_=dialogue, name="Sprechakte", rangeTypeName="uima.cas.Integer")

I looked at these two errors specifically:

Key Error: 447683: in this tag <type:AlreadyHandled xmi:id="15560" sofa="222088" begin="447683" end="447688"/> this one is wayyy out of bounds in terms of sofa length.
Key Error: 18369 if I am not mistaken this one is only out of bounds by a single byte so it may be the more interesting example. <type:temp5 xmi:id="161223" sofa="1" begin="17917" end="18369"/>

These two files also gave me errors: Arnim,-Bettina-von__Die Günderode.xmi.xmi.xmi and Lewald,-Fanny__Jenny.xml.xmi.xmi.txt.xmi.xmi.xmi.

This may all be down to the files being malformed, I don't know. The logic for determining byte offsets does seem straight forward enough.

reckart commented 3 years ago

What needs to be done here?

dkpro / dkpro-cassis

Handling of Invalid sofa indexes #162