flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.88k stars 2.1k forks source link

[Feature]: Correct misspelled entities #3280

Closed mirix closed 1 year ago

mirix commented 1 year ago

Problem statement

I am working with transcribed text. The general quality of the transcription is excellent but it contains a large number of misspelled entity names that are crucial for me.

For instance "Swisscode" or "Swiss Gold" instead of the correct "Swissquote" or "Alex Hormital" instead of the correct "ArcelorMittal".

Solution

Prior to reinventing the wheel, I was wondering if anyone would be aware of an existing NER tool that could correct such mistakes?

Otherwise, any tips on the best approach for implementing such solution would be greatly appreciated.

Additional Context

No response

helpmefindaname commented 1 year ago

Hi @mirix I think you'll be better off by first fixing spelling errors and then applying NER. Then you can use any of the many spelling correction blogs/tutorials on the internet.

As an alternative, you could look into Named Entity Linking, where you try to classify the database entry of your detected entity. That way you could just replace the text with the name given in your database.

mirix commented 1 year ago

Yes, I am following the second approach.

  1. NER with flair/ner-english-ontonotes-large (I need the large model in order to correctly identify all the organisations, otherwise some are recognised as persons).

  2. Compute distances to entries into a database. Right now I am considering a combination of (Damerau) Levenshtein and something phonetic like the Match Rating algorithm.

  3. If certain criteria are met, replace.

This seems to be working well but I have only a few examples.

mirix commented 1 year ago

It seems to work fairly well. Interestingly the edit distance approach alone (Levenshtein) did not work well enough. Same for the phonetic approach (doublemetaphone). I had to combine both (with slightly lower weight for edit distances) and also add extra penalties for spaces in certain cases.