geekusa / nlp-text-analytics

13 stars 6 forks source link

ERROR unary operator expected #7

Open lindonm opened 4 days ago

lindonm commented 4 days ago

Potentially related to recent update to Splunk_SA_Scientific_Python_linux_x86_64 - We are attempting to downgrade that app, but as we are Splunk Cloud, and that app is >500mb, we are unable to do so ourselves and are waiting on support team.

The following search fails with an error:

sourcetype="o365:management:activity" (Operation="New-InboxRule" OR Operation="Set-InboxRule") 
| eval textcheck="My Text Here"    
| fields textcheck
| cleantext textfield=textcheck keep_orig=true base_word=true remove_stopwords=false force_nltk_tokenize=true base_type="lemma_pos" term_min_len=1 ngram_mix=false

Streamed search execute failed because: Error in 'cleantext' command: External search command exited unexpectedly with non-zero error code 1..

Error log details

10-27-2024 22:42:08.255 INFO  UnifiedSearch [205064 localCollectorThread] - Initialization of search data structures took 829 ms
10-27-2024 22:42:08.255 INFO  UnifiedSearch [205064 localCollectorThread] - Processed search targeting arguments
10-27-2024 22:42:08.255 INFO  ServerConfig [205064 localCollectorThread] - Will add app jailing prefix /opt/splunk/bin/nsjail-wrapper for nlp-text-analytics
10-27-2024 22:42:08.255 INFO  ChunkedExternProcessor [205064 localCollectorThread] - Running process: /opt/splunk/bin/nsjail-wrapper /opt/splunk/bin/python3.7m /opt/splunk/etc/apps/nlp-text-analytics/bin/cleantext.py
10-27-2024 22:42:08.375 INFO  ReducePhaseExecutor [205041 StatusEnforcerThread] - ReducePhaseExecutor=1 action=PREVIEW
10-27-2024 22:42:08.503 ERROR ChunkedExternProcessor [207076 ChunkedExternProcessorStderrLogger] - stderr:  Failed to run splunk as SPLUNK_OS_USER. This command can only be run by bootstart user.
10-27-2024 22:42:08.503 ERROR ChunkedExternProcessor [207076 ChunkedExternProcessorStderrLogger] - stderr: /opt/splunk/etc/apps/Splunk_SA_Scientific_Python_linux_x86_64/bin/linux_x86_64/bin/python: line 5: [: ==: unary operator expected

If however I run this search, the search runs as expected with no erros:

| makeresults
| eval textcheck="My Text Here"    
| fields textcheck
| cleantext textfield=textcheck keep_orig=true base_word=true remove_stopwords=false force_nltk_tokenize=true base_type="lemma_pos" term_min_len=1 ngram_mix=false

Also this search works as well, by limiting the results?

sourcetype="o365:management:activity" (Operation="New-InboxRule" OR Operation="Set-InboxRule") 
| head 1
| eval textcheck="My Text Here"    
| fields textcheck
| cleantext textfield=textcheck keep_orig=true base_word=true remove_stopwords=false force_nltk_tokenize=true base_type="lemma_pos" term_min_len=1 ngram_mix=false

I have experimented with multiple numbers of results from "| head 1" to "|head 10000" - They all work but as soon as I remove the head command it fails. Note that in my selected time period there are only 24 entries, so even with "|head 1000" it works fine, but as soon as I remove that it fails with error.

Splunk Cloud Version:9.2.2406.107 (Victoria)

nlp-text-analytics v1.2.0 Splunk_SA_Scientific_Python_linux_x86_64 v4.2.1 Splunk_ML_Toolkit v5.4.2

geekusa commented 3 days ago

@lindonm unfortunately I don't have Splunk Cloud, however testing with the versions of the other apps you listed I cannot recreate the issue so far. I am wondering if there exists any strange characters in the events it is maybe choking on. Does this only occur with sourcetype="o365:management:activity"? Can you try a search like this with the built-in lookups to create a result of over 10000 events?

| inputlookup moby_dick.csv
| append 
  [ inputlookup peter_pan.csv]
| cleantext textfield=sentence keep_orig=true base_word=true remove_stopwords=false force_nltk_tokenize=true base_type="lemma_pos" term_min_len=1 ngram_mix=false 
lindonm commented 16 hours ago

Thanks @geekusa ,

Ran that query and it succeeds with no errors in the UI

This search has completed and has returned 12,750 results by scanning 0 events in 25.786 seconds
The following messages were returned by the search subsystem:
info : [subsearch]: Successfully read lookup file '/opt/splunk/etc/apps/nlp-text-analytics/lookups/peter_pan.csv'.

I did note some of the same/similar errors in the search log, so maybe those are a red herring.

0-31-2024 23:21:59.314 INFO  SearchParser [3300457 searchOrchestrator] - PARSING: | inputlookup moby_dick.csv\n| append \n  [ inputlookup peter_pan.csv]\n| cleantext textfield=sentence keep_orig=true base_word=true remove_stopwords=false force_nltk_tokenize=true base_type="lemma_pos" term_min_len=1 ngram_mix=false
10-31-2024 23:21:59.318 INFO  ServerConfig [3300457 searchOrchestrator] - Will add app jailing prefix /opt/splunk/bin/nsjail-wrapper for nlp-text-analytics
10-31-2024 23:21:59.318 INFO  ChunkedExternProcessor [3300457 searchOrchestrator] - Running process: /opt/splunk/bin/nsjail-wrapper /opt/splunk/bin/python3.7m /opt/splunk/etc/apps/nlp-text-analytics/bin/cleantext.py
10-31-2024 23:21:59.382 ERROR ChunkedExternProcessor [3300462 ChunkedExternProcessorStderrLogger] - stderr:  Failed to run splunk as SPLUNK_OS_USER. This command can only be run by bootstart user.
10-31-2024 23:21:59.382 ERROR ChunkedExternProcessor [3300462 ChunkedExternProcessorStderrLogger] - stderr: /opt/splunk/etc/apps/Splunk_SA_Scientific_Python_linux_x86_64/bin/linux_x86_64/bin/python: line 5: [: ==: unary operator expected
10-31-2024 23:22:00.690 INFO  SearchParser [3300457 searchOrchestrator] - PARSING:  inputlookup peter_pan.csv
10-31-2024 23:22:00.690 INFO  AstOptimizer [3300457 searchOrchestrator] - SrchOptMetrics optimize_toJson=1.373341992
10-31-2024 23:22:00.690 INFO  SearchParser [3300457 searchOrchestrator] - PARSING: | inputlookup "moby_dick.csv" | append [| inputlookup "peter_pan.csv"] | cleantext textfield=sentence keep_orig=true base_word=true remove_stopwords=false force_nltk_tokenize=true base_type="lemma_pos" term_min_len=1 ngram_mix=false
10-31-2024 23:22:00.690 INFO  SearchParser [3300457 searchOrchestrator] - PARSING: | inputlookup "moby_dick.csv" | append [| inputlookup "peter_pan.csv"] | cleantext textfield=sentence keep_orig=true base_word=true remove_stopwords=false force_nltk_tokenize=true base_type="lemma_pos" term_min_len=1 ngram_mix=false
10-31-2024 23:22:00.690 INFO  ServerConfig [3300457 searchOrchestrator] - Will add app jailing prefix /opt/splunk/bin/nsjail-wrapper for nlp-text-analytics
10-31-2024 23:22:00.690 INFO  ChunkedExternProcessor [3300457 searchOrchestrator] - Running process: /opt/splunk/bin/nsjail-wrapper /opt/splunk/bin/python3.7m /opt/splunk/etc/apps/nlp-text-analytics/bin/cleantext.py
10-31-2024 23:22:00.746 ERROR ChunkedExternProcessor [3300527 ChunkedExternProcessorStderrLogger] - stderr:  Failed to run splunk as SPLUNK_OS_USER. This command can only be run by bootstart user.
10-31-2024 23:22:00.746 ERROR ChunkedExternProcessor [3300527 ChunkedExternProcessorStderrLogger] - stderr: /opt/splunk/etc/apps/Splunk_SA_Scientific_Python_linux_x86_64/bin/linux_x86_64/bin/python: line 5: [: ==: unary operator expected
10-31-2024 23:22:01.406 INFO  SearchParser [3300457 searchOrchestrator] - PARSING: | inputlookup "peter_pan.csv"
10-31-2024 23:22:01.410 INFO  AstOptimizer [3300457 searchOrchestrator] - SrchOptMetrics optimize_toJson=0.717763582
lindonm commented 15 hours ago

Further to this, in experimenting trying to determine how much impact the actual source data makes;

Works:

| makeresults
| eval textcheck="My Text Here"    
| fields textcheck
| cleantext textfield=textcheck keep_orig=true base_word=true remove_stopwords=false force_nltk_tokenize=true base_type="lemma_pos" term_min_len=1 ngram_mix=false
| append 
    [search sourcetype="o365:management:activity" (Operation="New-InboxRule" OR Operation="Set-InboxRule") 
    | head 1]

Works:

| makeresults
| eval textcheck="My Text Here"    
| fields textcheck
| cleantext textfield=textcheck keep_orig=true base_word=true remove_stopwords=false force_nltk_tokenize=true base_type="lemma_pos" term_min_len=1 ngram_mix=false
| append 
    [search sourcetype="o365:management:activity" (Operation="New-InboxRule" OR Operation="Set-InboxRule") 
    ]