geekusa / nlp-text-analytics

13 stars 6 forks source link

Add support to supported NLTK languages #4

Closed pa1007 closed 2 years ago

pa1007 commented 2 years ago

Hi,

I was working with this app but wanted to add other languages. I added the support of other languages in the cleantext command and wanted to share it here Feel free to give me feedback on what you think of it .

Thanks

Glarell commented 2 years ago

Very usefull functionnality ! Go for it :) 👍

pa1007 commented 2 years ago

@geekusa what do you think of it ?

geekusa commented 2 years ago

Sorry been terribly busy, I'll take a look. Sounds like a great idea and looks like you've covered your bases. I think the only immediate concern I can think of is the size that it adds to the package (as Splunk has limits) and then also the thought of do we think we should include any example texts and list in the README languages that are supported. Sounds like we will also need a CONTRIBUTERS text now too (which is great!)

pa1007 commented 2 years ago

Thank you for the response. I know that file size can be a concern so we could remove some languages that could be less useful to the user and propose a way to guide them through adding more languages with a README file located in the bin/nltk_data/ folder or an explanation in the main README.

For the example data, I can start to look around if you want ;)

geekusa commented 2 years ago

Looks like when compressed, the changes as is make the package around 13 MB bigger (total of 37MB), the Splunk limit for an app for upload is 50MB. That is not too egregious, and I could be convinced otherwise, but I do like your idea of removing languages that are less likely to be used and just have a standard set and just have some instructions in the README. I was thinking maybe Dutch, French, German, Italian, Portuguese, Russian, and Spanish? Also maybe instead of a text input for the examples in the dashboards, we can have a drop-down so users don't mistype. If you agree, I could just take the pull request as is and then make the changes afterwards or you could make the changes first?

I was trying to find data on most popular literary languages which is tough to find. I did see this though https://www.statista.com/statistics/262946/share-of-the-most-common-languages-on-the-internet/

Example data probably isn't so important right now, as I wouldn't be able to tell if the examples are working anyhow.

pa1007 commented 2 years ago

I'm okay with your languages selection, found another source for your data ( https://en.wikipedia.org/wiki/Languages_used_on_the_Internet ) So I guess I can remove the others,

I made the change in the next commits (removed not selected languages and add a drop-down in the dashboard), I decided to stick with lower case label in the drop-down selector for the languages so as to not confuse the user if they wanted to use the command and arguments in the search tab.

I added a section near the end of the main readme of the project on how to add more languages but I will let you arrange and modify it as you wish!

pa1007 commented 2 years ago

@geekusa did you have time to look at the commits?

geekusa commented 2 years ago

Looks good, merged! I'll get going on a CONTRIBUTERS file

pa1007 commented 2 years ago

@geekusa thanks a lot, so you know when it will be available to download on splunkbase ?

geekusa commented 2 years ago

I'm working on another issue right now. When that is complete, I just need to do testing for Splunk 9x and then I will submit it

geekusa commented 2 years ago

It is available on splunkbase now https://splunkbase.splunk.com/app/4066/release/1.1.4/download, it can take awhile to get Splunk Cloud approval, when that happens I'll make it the default

geekusa commented 2 years ago

Actually that was fast it has been approved and version 1.1.4 is now the default