gandersen101 / spaczz

Fuzzy matching and more functionality for spaCy.
MIT License
252 stars 27 forks source link

Handling the same token in different categories #45

Closed brunobg closed 3 years ago

brunobg commented 3 years ago

This is not a report, but a question. If I have the same token with two different labels, how will spaczz handle it? The question comes because spacy seems to pick the label unpredictably: https://github.com/explosion/spaCy/discussions/6752

Questions:

a) is it possible to get both matches somehow? I'm interested in getting a list of all matches of a LABEL sometimes, and the "best ones" in other cases, to some definition of BEST :) b) if I can't get both, it is possible to get a callback to decide myself what to do?

gandersen101 commented 3 years ago

Hey @brunobg, putting me through the paces! Lol. I am going to focus on your other issue #41 first but then we can play around with this.

The SpaczzRuler more or less copies the spaCy's EntityRuler, just building on to it where needed, so I believe its label choices will be similarly unpredictable.

However, as I showed in #41 the FuzzyMatcher (along with the RegexMatcher and the new TokenMatcher) will return all viable matches sorted by start, then length. Using callbacks with these matches or incorporating them into your own spaCy compatible pipeline component is an option for now.

As some of my bandwidth gets freed up I will think about if there is potential to add this to spaczz itself.

gandersen101 commented 3 years ago

Hi @brunobg. So as I mentioned and showed the Matchers (fuzzy, regex, and token) will all provide all the available matches even if one match belongs to multiple categories. However there is not an easy way to provide configurable logic for this in the SpaczzRuler like with spaCy's EntityRuler. The matches go through at least one set and sort operation (descending length, ascending start position, then - for spaczz - descending match quality). If an entity belongs to multiple labels there is no predictable way to know which one will be added.

I can try to help you write a callback for your specific use-case if needed but it is hard to generically decide how to handle those situations when they logic probably varies from use-case to use-case.

brunobg commented 3 years ago

I understand. I imagine the only way to deal with this would be to handle the categories separately. If I create two spaczzrulers, could I get the results separately? This way the application could process data however it sees fit. This actually is a problem with Spacy as well, and the only way to handle it is through changing the ruler to store data in the _ metadata. Other than this question I think you can close the issue.

gandersen101 commented 3 years ago

Hmm, I'm not sure if adding another SpaczzRuler to the pipeline would help because spaCy tokens/spans can only belong to a single entity. If the first ruler adds entities of one label to the doc, when the second ruler tries to add the same entities (even if they have a different label) they will be ignored (or overwritten with the new label if overwritting is enabled). Like you mentioned this is a spaCy design and spaczz is just adhering to it because it ultimately uses spaCy underneath.

Two SpaczzRulers (or EntityRulers) would only give you separate results if they were in two separate pipelines and you were creating two Doc objects from each text source. This is completely viable but it means dealing with multiple doc objects for each text source and potentially figuring out how to merge their entities according to whatever logic you need.

There may be options I haven't thought of but below are the handful I can think of to try to address this issue:

  1. Use the ent_id field as a second label. However this limits you to two labels max and means you forego the use of ent_ids for another use.
  2. Write a callback for whichever matcher you are using to decide which label you want to assign based on whatever logic matters to you. If you want to retain additional labels these would have to go in the _ metadata.
  3. Create a custom pipeline component that accomplishes this. This is essentially the same as the option above but is wrapping the matcher in a pipeline component instead of writing a callback.
  4. As mentioned above, process your text with multiple pipelines (one for each labeling scheme). You'll end up with multiple docs and then you can decide how to pick/merge their labels together after.

I could try to help with 2, 3, or the doc merging aspect of 4. Maybe they have a place in spaczz, but I would need more specifics about the use-case you are dealing with that requires multiple labels.

brunobg commented 3 years ago

I think this is a big problem in spacy and that you shouldn't try to do better than they do. (2) would be nice to have since it makes the usual way to handle it (separate fields in _) easier, so if that is not complicated to implement go for it.