geekusa / nlp-text-analytics

13 stars 6 forks source link

Cyrilic text issues #2

Closed vani0vani0 closed 3 years ago

vani0vani0 commented 3 years ago

Hi, I am doing some experiments with this add-on and have tried to perform some sentiment analysis with the "vader" command. When I perform this test on English sentences it works flawlessly, but when I replace the sentence with Bulgarian, then it doesn't. No errors are found in the splunkd.log or the mlspl.log files.

Here are the two searches: | makeresults count=1 | eval text="This one works very well." | vader textfield=text ^ this one work fine and returns expected results.

| makeresults count=1 | eval text="Нещо не работи като хората и това не ми харесва." | vader textfield=text ^ this one does not work and returns no results (actually the search never finishes).

Same issue is experienced with other commands from the NLP add-on, e.g. cleantext. Performing normal SPL searches (both in English and Bulgarian) are working fine on the environment. The environment actually is all-in-one Splunk Enterprise 8.1.3 server on 64-bit Kali Linux. Python is v.2.7.18.

Any ideas?

geekusa commented 3 years ago

HI @vani0vani0 can you try replacing the folder nlp-text-analytics/bin/splunklib with splunklib from Splunk's latest SDK (https://github.com/splunk/splunk-sdk-python/tree/master/splunklib) and see if that fixes it?

vani0vani0 commented 3 years ago

Thank you @geekusa This actually fixed it. Now the commands are working fine, but there's another issue with the sentiment analysis command "vader". The sentiment value is always 0 when I use Cyrillic/Bulgarian text. If I add a sentence in English in the text, then the sentiment returned is different than 0.

Do you have any suggestion where to look at next?

geekusa commented 3 years ago

Glad to hear @vani0vani0 I went ahead and pushed the latest SDK into the repo. For the non-English sentiment question, the answer is more murky but leans itself towards being "no" depending on how much work you want to put in. Vader is a rule based sentiment analyzer (which means you can't teach it) and the lexicon for vader ($SPLUNK_HOME/nlp-text-analytics/bin/nltk_data/sentiment/vader_lexicon/vader_lexicon.txt) it is strictly English. But if you look at this stackoverflow question and answers (https://stackoverflow.com/questions/45275166/is-vader-sentimentintensityanalyzer-multilingual) there might be a way to use a web translator or build your own lexicon in the chosen language.