dkpro / dkpro-cassis

UIMA CAS processing library written in Python
https://pypi.org/project/dkpro-cassis/
Apache License 2.0
85 stars 22 forks source link

Possible inconsistency in test fixtures #297

Closed zesch closed 9 months ago

zesch commented 9 months ago

The tokens() and sentences() fixtures internally create a Cas but it is never returned. Thus the sofa_string is not set. Some test that rely on get_covered_text then individually set the sofa_string, but this could introduce inconsistencies.

Suggested improvement: create a fixture that directly returns a Cas and consistently sets the sofa_string to the actual tokens.

reckart commented 9 months ago

I removed two instances where a CAS was redundantly created and then not used in tests that used these fixtures.

But there seem to be more instances where these fixtures are used, e.g.:

def test_select(small_typesystem_xml, tokens, sentences):
    ts = load_typesystem(small_typesystem_xml)
    cas = Cas(typesystem=ts)
    cas.add_all(tokens + sentences)

That is a bit of a quirky case because it means that the tokens/sentences that were created on some CAS originally are now transplanted to a new one (in this case an empty one that does not even contain any text) and that makes them a bit of zombie annotations. For testing, it seems ok though and there can be cases where such a transplantation can make sense.

Any suggestion what to do here - if anything?

zesch commented 9 months ago

How about not creating the Cas in the test, but as part of the fixture? That way there is only one place where the annotations are bound to the Cas, so less room for using this wrong. Along those lines:

@pytest.fixture
def cas_simple(typesystem_xml, tokens, sentences):
    ts = load_typesystem(typesystem_xml)
    cas = Cas(ts)
    cas.sofa_string = CAS_TEXT

    cas.add_all(tokens)
    cas.add_all(sentences)

    return cas
reckart commented 9 months ago

If I understand your example correctly, then this is a new cas_simple fixture which gets injected with some pre-defined tokens and sentences which are then added to the CAS. But since the tokens and sentences do not know the CAS text, how could they have sensible offsets?

I think this would only work if the tokens/sentences would not be injected but rather generated using tokenizer/sentence splitter on the CAS text.

zesch commented 9 months ago

I wanted to stay as closely to the existing fixture structure, but you are right, this does not solve the underlying issue.

There seems to be some confusion about the relationship between annotations and Cas in the fixtures, e.g. here https://github.com/dkpro/dkpro-cassis/blob/49a66ee108b66a8f139ec4a5d2207b6884a96a60/tests/fixtures.py#L441-L465 AFAICT, adding the annotations to the Cas does not do anything, as only the tokens are returned later. (the same for sentences fixture)

reckart commented 9 months ago

Adding annotations to the CAS sets the sofa property inside the annotation. That is used in particular by annotation.get_covered_text() to retrieve the annotated text. If an annotation has not been added to a CAS, that method is defunct.

zesch commented 9 months ago

ok I see. That makes sense. Thanks for correcting my understanding :)