dkpro / dkpro-cassis

UIMA CAS processing library written in Python
https://pypi.org/project/dkpro-cassis/
Apache License 2.0
85 stars 22 forks source link

__eq__ on Type objects causes RecursionError #321

Open Tahnan opened 3 weeks ago

Tahnan commented 3 weeks ago

Description

If I have two different Types with at least some similarities (e.g. the same name), checking whether type_1 == type_2 leads to a RecursionError. To the best of my understanding:

  1. To see if the types are equal, it needs to check whether their supertypes are equal. To check that requires checking their supertypes, until eventually it hits uima.cas.TOP.
  2. The two Type(name=uima.cas.TOP) objects don't have supertypes, so there's no further recursion there. However, __eq__ also has to check that their children are equal, and their children are dictionaries from names to types.
  3. Unfortunately, to check whether those are equal, it has to check whether the values in the dictionary are equal, and since those are Type objects, it has to return to Step One above...

Of course, I'm working by a little bit of guesswork, so don't take that as definitive.

To Reproduce

Simply run the following commands:

one = cassis.load_dkpro_core_typesystem()
two = cassis.load_dkpro_core_typesystem()
one.get_type('uima.cas.Boolean') == two.get_type('uima.cas.Boolean')

and behold: RecursionError!

one.get_type('uima.cas.Boolean') == two.get_type('uima.cas.Float') returns False immediately, presumably because the names are different. Oddly, one.get_type('uima.cas.Boolean') == one.get_type('uima.cas.Boolean') returns True immediately, without a recursion depth error (does Python skip __eq__ in a == b if a is b? I don't think it does, so I'd expect this to have the same error...?).

Expected behavior Obviously, not a recursion error. Since the __eq__ method is generated by @attr.s, as far as I can tell, the only way around it might be to write a custom method, and I'm not sure what it should check. Is it enough to ask if they have the same .name and .typesystem? In this case I was trying to tell if they had the same features (and in fact they do not, as it happens), and it looks like Feature has its own non-@attr-defined equality method, so maybe that should be part of it too? Honestly I'm not sure; I don't spend much time working with two different typesystems at once.

Technical info

reckart commented 3 weeks ago

If the two type instances come from the same type system, it should be sufficient to simply compare their names. Only they from from two different typesystems, a more thorough test might be required to check if the definitions are compatible.

In which case do you compare two types to each other?

Tahnan commented 3 weeks ago

If the two type instances come from the same type system, it should be sufficient to simply compare their names.

So it's interesting, because if you look at the example I gave to recreate it, both typesystems are the result of cassis.load_dkpro_core_typesystem(). So they're different objects (one is two is of course False), but they have identical contents. Nevertheless, the RecursionError triggers.

In this case, I was actually comparing two different typesystems: one was the system you get from the load command, and the other was the typesystem on a document downloaded from INCEpTION in UIMA CAS JSON format. The typesystems in this case are definitely not equivalent, but I was pretty sure one was a subset of the other.


There's a related...not bug, but feature request, maybe. It's not important enough to me to write it up as an actual feature request, because as I said I'm only very rarely trying to work with two different typesystems. But for posterity, in case anyone else needs to know, here's what happened:

I downloaded a document from INCEpTION and tried to load it with cassis.load_document_from_json(doc, typesystem=my_typesystem) (where my_typesystem is cassis.load_dkpro_core_typesystem() plus some INCEpTION layers like custom.Span and so forth). This produced an error, even with merge_typesystems=True, because the conflict was not some types being missing. The problem was that both typesystems--the one in the document, and my_typesystem--had the type de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.morph.MorphologicalFeatures, but the feature number was defined differently on them.

OK, that's a reasonable error to get! If they define Features differently, then yes, trying to merge them should fail.

Except...the difference wasn't at all functional. Both of them had rangeTypeName = uima.cas.String. The only difference between the two features was that the one defined in this repo in dkpro-core-types.xml has a description on the feature ("Singular/plural"), and the one defined in the UIMA CAS document I was trying to load had no description. In fact, as far as I could tell, that was true all the way through it: the types had the same name, they had the same features on them, and all of those features were identical except that in one system they had descriptions and in the other they didn't.

As a programmer, I can totally understand "these differ, and I'm going to raise an Exception rather than try to guess why". As a user, it was so obvious how to reconcile them ("the description does not matter to me! just...leave it off everything!") that I really wanted the merge_typesystems option to figure it out. I can imagine a feature request that would do this (maybe a "merge_strategy" option that behaves as it currently does if specified as "strict", but specifying it as "ignore_descriptions"....?). But, as I said, I don't personally need it enough to request it, and I'm sure you've got higher priorities.

reckart commented 3 weeks ago

The recursion error is definitely a bug. I'm just asking to get a better idea of what an appropriate fix would be. Considering what you said, I think it should be ok to code eq method such that it returns True if both types have the same name and come from the same typesystem and otherwise return False.

I would say that merging fails if the difference is in the description of a feature, would also seem like a bug. Of course the description could be different and one could argue that if the descriptions differ, the semantics of the feature may actually be different. At least the case where one descriptions is empty and the other not should not trigger a failure. But probably even if both descriptions are set to different values, it should not fail.

Would you like to contribute a test/fix for the eq issue?