JohnSnowLabs / langtest

Deliver safe & effective language models
http://langtest.org/
Apache License 2.0
501 stars 40 forks source link

clarify inconsistency in model handler predictions #119

Closed alierenak closed 1 year ago

alierenak commented 1 year ago

There is a inconsistency between model handler outputs. I think we should find a standard kwargs inputs and model handler outputs to handler in a same way.

        # spacy returns if group_entities True
        if kwargs.get("group_entities"):
            return NEROutput(
                predictions=[
                    NERPrediction.from_span(
                        entity=ent.label_,
                        word=ent.text,
                        start=ent.start_char,
                        end=ent.end_char
                    ) for ent in doc.ents
                ]
            )

# This return  if group_entities True
# Apple -> ORG
# San Fransisco -> LOC
# else, None

There is no such thing in SparkNLP and transformers returns default grouped entities. This problem actually based on their results because spacy pipelines returns outputs in chunk format.

On the other hand, transformers pipeline returns after tokenization, so outputs are not same tokens with the inputs

prediction = transformers_pipeline('Apple is a technology company founded in San Fransisco')
print(" ".join([pred.get('entity_group', pred.get('entity', None)) for pred in prediction]))
# B-ORG B-LOC I-LOC I-LOC I-LOC
print(" ".join([pred.get('word') for pred in prediction]))
# Apple San Fr ##ans ##isco

All three model handlers should return same output format and should take same kwargs argument, I guess. I think all of them return in IOB format and there might be helper function to convert iob to chunk. After we have same output format, we can use same function to manipulate output.

What you think @JulesBelveze @ArshaanNazir

alierenak commented 1 year ago

I am mentioning @luca-martial to decide when we will handle this issue.