cyucelen / marker

🖍️ Marker is the easiest way to match and mark strings for colorful terminal outputs!
MIT License
48 stars 13 forks source link

Match invalid English words #21

Closed Srivats1991 closed 5 years ago

Srivats1991 commented 5 years ago

Given a string we need to match all valid english words that does not have numericals in them. One use case that I can think of is this feature is useful in matching all misspelled english words. For example:

"This word has a singel error" Ouput : singel

"This word is not a word1" No words matched here

dapryor commented 5 years ago

If you want to implement this, I think something like a bloom filter may be a good fit for quickly determining if a word is a real english word. If there is a a way you could generate the filter then embed it in the package, that would probably be the best way to do it.

Edit 1: The only thing is that there are chances to get false positives, but at least you will never get a false negative.

Edit 2: You could take this https://github.com/dwyl/english-words/blob/master/words_alpha.txt as the data to load in.

Edit 3: If you wanted a filter to match all purely alpha words with a false positive rate of 1 in 9994092, you would need these values: https://hur.st/bloomfilter/?n=370103&p=1.0E-7&m=&k=

cyucelen commented 5 years ago

Anybody working on this feature? @Srivats1991 @dapryor

This can excite some NLP guys maybe?

dapryor commented 5 years ago

@cyucelen I am not. Not sure about @Srivats1991 Whoever works on it needs to make sure whatever method used it fast enough. That is why I was suggesting maybe a bloom filter. I am not super familiar with NLP methods so I am not sure the speed.

cyucelen commented 5 years ago

Closing this since nobody interested.