JasonKessler / scattertext

Beautiful visualizations of how language differs among document types.
Apache License 2.0
2.23k stars 287 forks source link

Function to remove numbers from corpus? #89

Closed havardl closed 3 years ago

havardl commented 3 years ago

First off, thanks for a great package!

I was wondering if there is something similar to the remove_terms() function which would filter out all numbers? I could of course generate a x long list of numbers between 0 and n and feed that to the function, but just wanted to check if there already was specific support for this.

JasonKessler commented 3 years ago

There's not an explicit function, but running something along the lines of

corpus_no_numbers = corpus.remove_terms([t for t in corpus.get_terms() if re.match('^\d+$', t)])

should do the trick

JasonKessler commented 3 years ago

To close the loop on this, since coming up with a good definition of a number is tricky (do ordinals count? written numbers? what about "one" as a determiner or used idiomatically etc.) adding a "remove_numbers" method is probably more trouble than it's worth, especially given the concise way any approach can be executed.