There is a inconsistency between model handler outputs. I think we should find a standard kwargs inputs and model handler outputs to handler in a same way.
# spacy returns if group_entities True
if kwargs.get("group_entities"):
return NEROutput(
predictions=[
NERPrediction.from_span(
entity=ent.label_,
word=ent.text,
start=ent.start_char,
end=ent.end_char
) for ent in doc.ents
]
)
# This return if group_entities True
# Apple -> ORG
# San Fransisco -> LOC
# else, None
There is no such thing in SparkNLP and transformers returns default grouped entities. This problem actually based on their results because spacy pipelines returns outputs in chunk format.
On the other hand, transformers pipeline returns after tokenization, so outputs are not same tokens with the inputs
prediction = transformers_pipeline('Apple is a technology company founded in San Fransisco')
print(" ".join([pred.get('entity_group', pred.get('entity', None)) for pred in prediction]))
# B-ORG B-LOC I-LOC I-LOC I-LOC
print(" ".join([pred.get('word') for pred in prediction]))
# Apple San Fr ##ans ##isco
All three model handlers should return same output format and should take same kwargs argument, I guess. I think all of them return in IOB format and there might be helper function to convert iob to chunk. After we have same output format, we can use same function to manipulate output.
There is a inconsistency between model handler outputs. I think we should find a standard kwargs inputs and model handler outputs to handler in a same way.
There is no such thing in SparkNLP and transformers returns default grouped entities. This problem actually based on their results because spacy pipelines returns outputs in chunk format.
On the other hand, transformers pipeline returns after tokenization, so outputs are not same tokens with the inputs
All three model handlers should return same output format and should take same kwargs argument, I guess. I think all of them return in IOB format and there might be helper function to convert iob to chunk. After we have same output format, we can use same function to manipulate output.
What you think @JulesBelveze @ArshaanNazir