dkpro / dkpro-cassis

UIMA CAS processing library written in Python
https://pypi.org/project/dkpro-cassis/
Apache License 2.0
84 stars 22 forks source link

Problem with Name String Array primitive array type during deserialization #257

Closed simonmeoni closed 2 years ago

simonmeoni commented 2 years ago

I had a problem when I parse XMI and typesystem containing String Array primitive type. The parser gave me the following error :

self = <cassis.xmi.CasXmiDeserializer object at 0x7ff109435430>
type_ = Type(name=uima.cas.StringArray), value = ''

    def _parse_primitive_array(self, type_: Type, value: Union[str, List[str]]) -> List:
        """Primitive collections are serialized as white space separated primitive values"""

        if value is None:
            return None

        # TODO: Use type name global variable here instead of hardcoded string literal
        elements = value.split() if isinstance(value, str) else value

        type_name = type_.name
        if type_name in [TYPE_NAME_FLOAT_ARRAY, TYPE_NAME_DOUBLE_ARRAY]:
            return [float(e) for e in elements] if value else []
        elif type_name in [TYPE_NAME_INTEGER_ARRAY, TYPE_NAME_SHORT_ARRAY, TYPE_NAME_LONG_ARRAY]:
            return [int(e) for e in elements] if value else []
        elif type_name == TYPE_NAME_BOOLEAN_ARRAY:
            return [self._parse_bool(e) for e in elements] if value else []
        elif type_name == TYPE_NAME_BYTE_ARRAY:
            return list(bytearray.fromhex(value)) if value else []
        else:
>           raise ValueError(f"Not a primitive collection type: {type_name}")
E           ValueError: Not a primitive collection type: uima.cas.StringArray

../../../../miniconda/envs/fhir-lighter/lib/python3.9/site-packages/cassis/xmi.py:445: ValueError

I modified the code to accept and parse String Array type as the all Primitive Array type inside the method. Is it the right place to get able to process String Array type, or is it against the UIMA CAS format philosophy (I'm not an expert)? Thanks in advance 🤗

reckart commented 2 years ago

Could you please provide a minimal sample XMI file that cassis fails to parse? Could you maybe add a unit test for the case that fails to your PR?

reckart commented 2 years ago

Also, which version of cassis did you encounter the problem in?

simonmeoni commented 2 years ago

Also, which version of cassis did you encounter the problem in?

I tested on these both versions : dkpro-cassis-0.7.1 and dkpro-cassis-0.8.0.dev0. Is there another version that could be probably working for my case?

simonmeoni commented 2 years ago

Could you please provide a minimal sample XMI file that cassis fails to parse? Could you maybe add a unit test for the case that fails to your PR?

I provide you an example who didn't work for me without my change. It concerns the typesystem.xml above all (I suppose)

reckart commented 2 years ago

I have not looked at it in detail yet, but if I remember correctly, the place where you added the string array handling is about inlined arrays in the form <XXX arrayFeature="true false true true"/> where the array values are space separated. In that code, we only see a single XML element and its attributes. Now, for a string array this kind of inlining is not possible because a string can contain spaces. So for string arrays, the form

<XXX>
  <arrayFeature>value 1</arrayFeature>
  <arrayFeature>value 2</arrayFeature>
  <arrayFeature>value 3</arrayFeature>
</XXX>

must be used - which is done in a slightly different part of the code where we walk the XML element tree.

simonmeoni commented 2 years ago

The typesystem I provided you was generated by Inception. When I keep the string array element inside the typesystem, the parser crashes even if my xmi file doesn't have any string array annotation. If I remove it, the parser works. Is it a problem from Inception who generates an unreadable typesystem for cassis?

reckart commented 2 years ago

So the problem here seems to be that Assertion is a StringArray feature but in the XMI file it appears as an attribute with an empty value:

<custom:Span xmi:id="298" sofa="1" begin="13" end="19" Assertion="" label="LABEL1"/>

... another of the XMI oddities ... I need to check how UIMA treats this in the Java implementation to decide how to treat it in cassis.

simonmeoni commented 2 years ago

I tried to use a previous version of cassis. And version 0.5.3 seems to work with string array. The only difference is type checking :

Maybe it lacks some additional checking for primitive arrays?

reckart commented 2 years ago

We don't have a case of an emtpy string array in our test suite so far. So they need to be updated to include such a case.

As for reading: your change works, but IMHO it goes to far. It works because it splits an empty attribute value and then creates an empty array from that. The only thing we should support there is if the attribute is empty, then create an empty string. If the attribute is not empty, then we have a bug because that case should be handled elsewhere.

And then we need to check if writing empty string arrays using cassis works properly...

reckart commented 2 years ago

We have an issue for this now: Empty string arrays are not supported #258