davidmogar / cucco

Text normalization library for Python
MIT License
203 stars 27 forks source link

Features added: #44

Open martinarrieta opened 7 years ago

martinarrieta commented 7 years ago

Hello,

Great project, I was trying to do something like that until I find yours. In order to solve all my needs I had to add a few things. Let me know what you think.

Unfortunately I couldn't make the test cases for that because I don't have experience with pytest, but I'm adding examples of how it works.

Things that I have added:

  1. Custom stop words option in the remove_stop_words method
  2. Added the method "remove_numbers" to remove the numbers
  3. Added the method "replace_custom_regex" to remove a custom regex in the text.

Sample code to test my changes:

from cucco.cucco import Cucco

from cucco.config import Config
cucco_config = Config(language='en')
c = Cucco(config=cucco_config)

# Replace numbers but those that are only numbers, not numbers between letters.
c.replace_numbers("this 3333 i3s a text with the number 2")

# Removing custom regex, for example all #foo and @bar
import re
regex = re.compile(r"[#@]\w+", re.IGNORECASE)
c.replace_custom_regex(regex=regex, text= "Test a string #foo to replace @bar")

# Removing custom stop words

# This is the default one, with all stop words:
c.remove_stop_words("Test to remove stop words")

# This is with a custom set of stop words (in case that you want to use your own set):
c.remove_stop_words("Test to remove stop words", custom_stop_words=['test', 'to'])

Sample code with the console output:

In [1]: from cucco.cucco import Cucco
   ...:
   ...: from cucco.config import Config
   ...:
   ...: cucco_config = Config(language='en')
   ...:
   ...: c = Cucco(config=cucco_config)
   ...:

In [3]: # Replace numbers but those that are only numbers, not numbers between letters.
   ...: c.replace_numbers("this 3333 i3s a text with the number 2")
Out[3]: 'this i3s a text with the number'

In [4]: # Removing custom regex, for example all #foo and @bar
   ...: import re
   ...: regex = re.compile(r"[#@]\w+", re.IGNORECASE)
   ...: c.replace_custom_regex(regex=regex, text= "Test a string #foo to replace @bar")
   ...:
Out[4]: 'Test a string  to replace '

In [5]: # This is the default one, with all stop words:
   ...: c.remove_stop_words("Test to remove stop words")
   ...: 'test remove stop words'
   ...:
Out[5]: 'test remove stop words'

In [6]: # This is with a custom set of stop words (in case that you want to use your own set):
   ...: c.remove_stop_words("Test to remove stop words", custom_stop_words=['test', 'to'])
   ...: 'remove stop words'
   ...:
Out[6]: 'remove stop words'
davidmogar commented 7 years ago

Thank you for your contribution. I really appreciate.

It will take me some time to review it. As you can see, this is still a single guy project and these days I'm a bit busy. But for sure I will review it and try to add your great suggestions.

Cheers.