dkpro / dkpro-cassis

UIMA CAS processing library written in Python
https://pypi.org/project/dkpro-cassis/
Apache License 2.0
84 stars 22 forks source link

Wrong entities are removed when iterating through CAS #259

Closed giuliabaldini closed 1 year ago

giuliabaldini commented 1 year ago

Describe the bug Hey there, thank you for the great package! I am currently using it for my annotations, and I have noticed something weird when trying to remove multiple entities.

To Reproduce Steps to reproduce the behavior:

import cassis

typesystem = cassis.TypeSystem()
ner_type = typesystem.create_type(
    name="NamedEntity", supertypeName="uima.tcas.Annotation"
)
typesystem.create_feature(
    domainType=ner_type, name="source", rangeType=cassis.typesystem.TYPE_NAME_STRING
)

cas = cassis.Cas(typesystem)
for i in range(100):
    if i % 2:
        cas.add(ner_type(source="spacy"))
    else:
        cas.add(ner_type(source="user"))

print("Possible values annotation.source", set(entity.source for entity in cas.select("NamedEntity")))
print(
    "Number of annotations where source is not user",
    sum(1 for entity in cas.select("NamedEntity") if entity.source != "user"),
)
found_entities = 0
for entity in cas.select("NamedEntity"):
    if entity.source != "user":
        found_entities += 1
        cas.remove(entity)
print("Found and removed", found_entities, "entities")
print("Possible values annotation.source", set(entity.source for entity in cas.select("NamedEntity")))
print(
    "Number of annotations where source is not user after removal",
    sum(1 for entity in cas.select("NamedEntity") if entity.source != "user"),
)

Output:

Possible values annotation.source {'spacy', 'user'}
Number of annotations where source is not user 50
Found and removed 50 entities
Possible values annotation.source {'spacy', 'user'}
Number of annotations where source is not user after removal 25

Expected behavior I would expect the remove function to remove all the entities where source is not user. Also, since it removes 50 entities, it also removes annotations that do have user as source.

Please complete the following information:

Thank you very much in advance! Best, Giulia

reckart commented 1 year ago

I think the problem may be that cas.select("NamedEntity") returns a "live view" of the data, so when you remove stuff from the CAS, that loop gets confused. I think you can write something like list(cas.select("NamedEntity")) to avoid the problem.

giuliabaldini commented 1 year ago

I think the problem may be that cas.select("NamedEntity") returns a "live view" of the data, so when you remove stuff from the CAS, that loop gets confused. I think you can write something like list(cas.select("NamedEntity")) to avoid the problem.

Hey, thank you very much for your answer!

cas.select("NamedEntity") is already a list, I have checked that with print(type(cas.select("NamedEntity"))). I have also now run it as you suggested, but the result is the same.

I have also tried doing the following:

to_remove = []
for entity in list(cas.select("NamedEntity")):
    if entity.source != "user":
        found_entities += 1
        to_remove.append(entity)
for e in to_remove:
    cas.remove(e)

which also results in the same output.

reckart commented 1 year ago

Found the bug. Annotations were considered equal when their positions were equal which cause the undesired behavior. In fact, I wonder why any annotations remained in the index. That said, I think I have a fix - PR is already there.

reckart commented 1 year ago

@giuliabaldini you can try running your code against the main branch

giuliabaldini commented 1 year ago

@reckart thank you very much, it was super fast! From a quick test, everything seems to be alright, I'll get back to you if I have any other problems. Have a nice weekend!