NaturalNode / natural

general natural language facilities for node
MIT License
10.59k stars 861 forks source link

Ngrams destroy punctuation #444

Open giorgio79 opened 6 years ago

giorgio79 commented 6 years ago

Example: Would be nice to have an option that preserves punctuation:

console.log(nautral_NGrams.bigrams('Some, words here!!'));
[ [ 'Some', 'words' ], [ 'words', 'here' ] ]

I would have liked to see [ [ 'Some,', 'words' ], [ 'words', 'here!!' ] ]

If chaining commands is implemented eventually at https://github.com/NaturalNode/natural/issues/439 than one could just strip punctuation previously, or pass in to tokenizator first.

giorgio79 commented 6 years ago

Also, tokenizers already split the text in various ways, so I would just keep the splitting logic with the tokenizers...