aarondandy / WeCantSpell.Hunspell

A port of Hunspell v1 for .NET and .NET Standard
https://www.nuget.org/packages/WeCantSpell.Hunspell/
Other
126 stars 19 forks source link

Parsing text for individual words #77

Closed sparticus1701 closed 1 month ago

sparticus1701 commented 1 year ago

This is more of a question, but I'd like to use this in a project I'm working on. From what I can tell WordList.Check is designed to check single words.

Are there any recommendations on what tool to use that I can break up sentences into words, etc., that should be checked? A naive way would be to just use string.split(), but I'd like to see if there's a tool that can automatically handle numbers, currency, sentence puncuation. I've been looking at some NLP tools but wondering if you've used anything in particular.

aarondandy commented 1 year ago

I have not. The reason I made this port was to make a spell checker for Roslyn but got bored before that could have ever happened. Splitting identifiers into words is much easier than NLP though 😁. I haven't yet done anything with human grammars, so I don't think I can point you in the right direction for that.

funex commented 1 year ago

@sparticus1701 Have a look at this closed issue, it answers your question: #75

"The text boundary positions are found according to the rules described in Unicode Standard Annex 29, Text Boundaries, and Unicode Standard Annex 14, Line Breaking Properties. These are available at http://www.unicode.org/reports/tr14/ and http://www.unicode.org/reports/tr29/."

funex commented 1 year ago

@aarondandy I would close this one.