Closed Srivats1991 closed 5 years ago
If you want to implement this, I think something like a bloom filter may be a good fit for quickly determining if a word is a real english word. If there is a a way you could generate the filter then embed it in the package, that would probably be the best way to do it.
Edit 1: The only thing is that there are chances to get false positives, but at least you will never get a false negative.
Edit 2: You could take this https://github.com/dwyl/english-words/blob/master/words_alpha.txt as the data to load in.
Edit 3: If you wanted a filter to match all purely alpha words with a false positive rate of 1 in 9994092, you would need these values: https://hur.st/bloomfilter/?n=370103&p=1.0E-7&m=&k=
Anybody working on this feature? @Srivats1991 @dapryor
This can excite some NLP guys maybe?
@cyucelen I am not. Not sure about @Srivats1991 Whoever works on it needs to make sure whatever method used it fast enough. That is why I was suggesting maybe a bloom filter. I am not super familiar with NLP methods so I am not sure the speed.
Closing this since nobody interested.
Given a string we need to match all valid english words that does not have numericals in them. One use case that I can think of is this feature is useful in matching all misspelled english words. For example: