dkpro / dkpro-cassis

UIMA CAS processing library written in Python
https://pypi.org/project/dkpro-cassis/
Apache License 2.0
84 stars 22 forks source link

Unexpected behavior .select_covered( ... ) #151

Closed ArneDefauw closed 3 years ago

ArneDefauw commented 3 years ago

Describe the bug

cas.select_covered( ... , .. ) method does not return all covered elements in some situations

To Reproduce

Small example to reproduce the behavior.

Steps to reproduce the behavior:

Use small_typesystem.xml and small_cas.xml, from https://github.com/dkpro/dkpro-cassis/tree/master/tests/test_files

from cassis import 

with open('small_typesystem.xml', 'rb') as f: 
    typesystem = load_typesystem(f)

with open('small_cas.xmi', 'rb') as f:
    cas = load_cas_from_xmi(f, typesystem=typesystem)

Token = typesystem.get_type('cassis.Token')
Sentence = typesystem.get_type('cassis.Sentence')
cas.add_annotation(Token(begin=0, end=10, id='11', pos='NNP') )
list( cas.select( 'cassis.Sentence' ) )

is equal to: [cassis_Sentence(xmiID=14, id='0', begin=0, end=26, type='cassis.Sentence'), cassis_Sentence(xmiID=15, id='1', begin=27, end=47, type='cassis.Sentence')]

and

list( cas.select( 'cassis.Token') ) is:

[cassis_Token(xmiID=3, id='0', pos='NNP', begin=0, end=3, type='cassis.Token'), cassis_Token(xmiID=16, id='11', pos='NNP', begin=0, end=10, type='cassis.Token'), cassis_Token(xmiID=4, id='1', pos='VBD', begin=4, end=10, type='cassis.Token'), cassis_Token(xmiID=5, id='2', pos='IN', begin=11, end=14, type='cassis.Token'), cassis_Token(xmiID=6, id='3', pos='DT', begin=15, end=18, type='cassis.Token'), cassis_Token(xmiID=7, id='4', pos='NN', begin=19, end=24, type='cassis.Token'), cassis_Token(xmiID=8, id='5', pos='.', begin=25, end=26, type='cassis.Token'), cassis_Token(xmiID=9, id='6', pos='DT', begin=27, end=30, type='cassis.Token'), cassis_Token(xmiID=10, id='7', pos='NN', begin=31, end=36, type='cassis.Token'), cassis_Token(xmiID=11, id='8', pos='VBD', begin=37, end=40, type='cassis.Token'), cassis_Token(xmiID=12, id='9', pos='JJ', begin=41, end=45, type='cassis.Token'), cassis_Token(xmiID=13, id='10', pos='.', begin=46, end=47, type='cassis.Token')]

while

list( cas.select_covered('cassis.Token', list( cas.select( 'cassis.Sentence' ))[0] ) ) is:

[cassis_Token(xmiID=16, id='11', pos='NNP', begin=0, end=10, type='cassis.Token'), cassis_Token(xmiID=4, id='1', pos='VBD', begin=4, end=10, type='cassis.Token'), cassis_Token(xmiID=5, id='2', pos='IN', begin=11, end=14, type='cassis.Token'), cassis_Token(xmiID=6, id='3', pos='DT', begin=15, end=18, type='cassis.Token'), cassis_Token(xmiID=7, id='4', pos='NN', begin=19, end=24, type='cassis.Token'), cassis_Token(xmiID=8, id='5', pos='.', begin=25, end=26, type='cassis.Token')]

Expected behavior list( cas.select_covered('cassis.Token', list( cas.select( 'cassis.Sentence' ))[0] ) ) should also contain

Token(begin=0, end=3, id='0', pos='NNP')

This problem only seems to occur if begin index of an overlapping Token (Tokens with id=11 and id=0) coincides with the begin index of a Sentence.

I was annotating multi-words, where such situation (overlapping (multi-)Tokens) is not uncommon.

reckart commented 3 years ago

Just to mention: I am working over at the Apache UIMA Java SDK on a test suite for the select API that we have there (part of that work is in this PR). I think it would also be very helpful for cassis to have such a suite.

Basically, what I do in the test suite is:

jcklie commented 3 years ago

@ArneDefauw Does that happen in master or the last release? I changed it a bit over the weekend, so I wonder whether that is a fix or the reason for bad things happening now

ArneDefauw commented 3 years ago

It happens both in the latest release ( 0.4.0 ) and in 0.3.0

jcklie commented 3 years ago

Then I will check later whether it is still an issue in master. Thanks for reporting! See also #144

ArneDefauw commented 3 years ago

I checked, and #144 fixes the issue. Thanks!

LaurentBie commented 3 years ago

It would be nice to publish a new release (in PYPI ) with the bug #144 corrected. I'm working on some packages that have dkpro cassis as a dependency and as far as I know, it's not possible in a python package to declare dependency from Github.

jcklie commented 3 years ago

It is, you can write that also into your requirements.txt or setup.py e.g.

https://stackoverflow.com/questions/32688688/how-to-write-setup-py-to-include-a-git-repo-as-a-dependency https://adamj.eu/tech/2019/03/11/pip-install-from-a-git-repository/

I will release on this weekend though.

ArneDefauw commented 3 years ago

It would be nice to publish a new release (in PYPI ) with the bug #144 corrected. I'm working on some packages that have dkpro cassis as a dependency and as far as I know, it's not possible in a python package to declare dependency from Github.

pip install -e git://github.com/dkpro/dkpro-cassis.git@bugfix/144-overlapping-select-covered#egg=dkpro-cassis worked for me

LaurentBie commented 3 years ago

Ok thanks for the link.

El mar., 24 nov. 2020 11:12, ArneD notifications@github.com escribió:

It would be nice to publish a new release (in PYPI ) with the bug #144 https://github.com/dkpro/dkpro-cassis/issues/144 corrected. I'm working on some packages that have dkpro cassis as a dependency and as far as I know, it's not possible in a python package to declare dependency from Github.

pip install -e git:// github.com/dkpro/dkpro-cassis.git@bugfix/144-overlapping-select-covered#egg=dkpro-cassis worked for me

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-cassis/issues/151#issuecomment-732797075, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7JD4VSW4LTGAJYLHKW7CLSROBHXANCNFSM4T7OTBDQ .