geekusa / nlp-text-analytics

13 stars 6 forks source link

NLP Text Analytics Splunk App

The intent of this app is to provide a simple interface for analyzing text in Splunk using python natural language processing libraries (currently just NLTK 3.9.1). The app provides custom commands and dashboards to show how to use.

Available at: Github

Splunk App Available at: https://splunkbase.splunk.com/app/4066/

Version: 1.2.0

Author: Nathan Worsham

Created for MSDS692 Data Science Practicum I at Regis University, 2018
See associated blog for detailed information on the project creation.

Update: Additional content (combined features algorithms) created for MSDS696 Data Science Practicum II at Regis University, 2018
See associated blog for detailed information on the project creation. This app was part of the basis for a breakout session at Splunk Conf18 I was lucky enough to present at--Extending Splunk MLTK using GitHub Community. Session Slides Session Recording

Contributors: Since release to open source, the project now has contributors! See CONTRIBUTORS

Description and Use-cases

Have you ever wanted to perform advanced text analytics inside Splunk? Splunk has some ways to handle text but also lacks some more advanced features that NLP libraries can offer. This can also benefit use-cases that involve using Splunk’s ML Toolkit.

Note About Visualization Deprecation

Many of the visualizations that the dashboards use in this app are either deprecated or will reach end of support on Dec 21, 2024 see https://lantern.splunk.com/@go/page/7824. This app's original intention was about providing custom commands and extending the algorithms from sklearn that MLTK does not implement, and the dashboards were only meant as showing examples of how to use them. Some of these visualizations can see extended life like seen in https://community.splunk.com/t5/All-Apps-and-Add-ons/Failed-to-load-source-for-Wordcloud-visualization/m-p/665787#M79891, but it certainly is only delaying the inevitable. Have begun attempting to migrate the dashboards to the new dashboard studio versions but as of version 9.3.0 there are still many features of SimpleXML that do not have an equivalent feature in Dashboard Studio.

Requirements

Splunk ML Toolkit 3.2 or greater https://splunkbase.splunk.com/app/2890/
Python for Scientific Computing (download appropriate version for platform being used)
Wordcloud Custom Visualization https://splunkbase.splunk.com/app/3212/ (preferred) OR Splunk Dashboard Examples https://splunkbase.splunk.com/app/1603/
Parallel Coordinates Custom Visualization https://splunkbase.splunk.com/app/3137/
Force Directed App For Splunk https://splunkbase.splunk.com/app/3767/ Halo - Custom Visualization https://splunkbase.splunk.com/app/3514/ Sankey Diagram - Custom Visualization https://splunkbase.splunk.com/app/3112/

How to use

Install

Normal app installation can be followed from https://docs.splunk.com/Documentation/AddOns/released/Overview/AboutSplunkadd-ons. Essentially download app and install from Web UI or extract file in $SPLUNK_HOME/etc/apps folder.

Example Texts

The app comes with example Gutenberg texts formatted as CSV lookups along with the popular "20 newsgroups" dataset. Load them with the syntax | inputlookup <filename.csv>

Text Names

20newsgroups.csv
moby_dick.csv
peter_pan.csv
pride_prejudice.csv

Custom Commands

bs4

Description

A wrapper for BeautifulSoup4 to extract html/xml tags and text from them to use in Splunk. A wrapper script to bring some functionality from BeautifulSoup to Splunk. Default is to get the text and send it to a new field 'get_text', otherwise the selection is returned in a field named 'soup'. Default is to use the 'lxml' parser, though you can specify others, 'html5lib' is not currently included. The find methods can be used in conjuction, their order of operation is find > find_all > find_child > find children. Each option has a similar named option appended '_attrs' that will accept inner and outer quoted key:value pairs for more precise selections.

Syntax

*| bs4 textfield= [get_text=] [get_text_label=] [parser=] [find=] [find_attrs=<quoted_key:value_pairs>] [find_all=] [find_all_attrs=<quoted_key:value_pairs>] [find_child=] [find_child_attrs=<quoted_key:value_pairs>] [find_children=] [find_children_attrs=<quoted_key:value_pairs>]

Required Arguments

textfield
Syntax: textfield=\
Description: The search field that contains the text that is the target.
Usage: Option only takes a single field

Optional Arguments

get_text
Syntax: get_text=\
Description: If true, returns text minus html/xml formatting for given selection and places in field get_text otherwise returns the selection in a field called soup1.
Usage: Boolean value. True or False; true or false, t or f, 0 or 1
Default: True

get_text_label
Syntax: get_text_label=\
Description: If get_text is set, sets the label for the return field.
Usage: String value
Default: get_text

get_attr
Syntax: get_attr=\
Description: If set, returns attribute value for given selection and places in field of the same name.
Usage: String value

parser
Syntax: parser=\
Description: Corresponds to parsers listed here (currently html5lib not packaged with so not an option).
Usage: Possible values are html.parser, lxml, lxml-xml, or xml Default: lxml

find
Syntax: find=\
Description: Corresponds to the name attribute of BeautifulSoup's find method.
Usage: HTML or XML element name

find_attrs
Syntax: find_attrs=\
Description: Corresponds to the attrs attribute of BeautifulSoup's find method. Expects inner and outer quoted key:value pairs comma-separated but contained in outer quotes. Usage: "'key1':'value1','key2':'value2'"

find_all
Syntax: find_all=\<tag(s)>
Description: Corresponds to the name attribute of BeautifulSoup's find_all method. Order of operation is find > find_all > find_child > find_children so can be used in conjunction. Can find one or more tags by comma separating tags (also quote entire option) i.e. find_all="div, a".
Usage: HTML or XML element name

find_all_attrs
Syntax: find_all_attrs=\
Description: Corresponds to the attrs attribute of BeautifulSoup's find_all method. Expects inner and outer quoted key:value pairs comma-separated but contained in outer quotes. Usage: "'key1':'value1','key2':'value2'"

find_child
Syntax: find_child=\
Description: Corresponds to the name attribute of BeautifulSoup's find_child method. Order of operation is find > find_all > find_child > find_children so can be used in conjunction.
Usage: HTML or XML element name

find_child_attrs
Syntax: find_child_attrs=\
Description: Corresponds to the attrs attribute of BeautifulSoup's find_child method. Expects inner and outer quoted key:value pairs comma-separated but contained in outer quotes. Usage: "'key1':'value1','key2':'value2'"

find_children
Syntax: find_children=\
Description: Corresponds to the name attribute of BeautifulSoup's find_children method. Order of operation is find > find_all > find_child > find_children so can be used in conjunction.
Usage: HTML or XML element name

find_children_attrs
Syntax: find_children_attrs=\
Description: Corresponds to the attrs attribute of BeautifulSoup's find_children method. Expects inner and outer quoted key:value pairs comma-separated but contained in outer quotes. Usage: "'key1':'value1','key2':'value2'"

cleantext

Description

Tokenize and normalize text (remove punctuation, digits, change to base_word). Different options result in better and slower cleaning. base_type="lemma_pos" being the slowest option, base_type="lemma" assumes every word is a noun, which is faster but still results in decent lemmatization. Many fields have a default already set, textfield is only required field. By default results in a multi-valued field which is ready for used with stats count by. Optionally return special fields for analysis--pos_tags and ngrams.

Syntax

* | cleantext textfield= [keep_orig=] [default_clean=] [remove_urls=] [remove_stopwords=] [base_word=] [base_type=] [mv=] [force_nltk_tokenize=] [pos_tagset=] [custom_stopwords=<comma_separated_string_list>] [term_min_len=] [ngram_range=-] [ngram_mix=]

Required Arguments

textfield
Syntax: textfield=\
Description: The search field that contains the text that is the target of the analysis.
Usage: Option only takes a single field

Optional Arguments

keep_orig
Syntax: keep_orig=\
Description: Maintain a copy of the original text for comparison or searching into field called orig_text
Usage: Boolean value. True or False; true or false, t or f, 0 or 1
Default: False

default_clean
Syntax: default_clean=\
Description: Perform basic text cleaning--lowercase, remove punctuation and digits, and tokenization.
Usage: Boolean value. True or False; true or false, t or f, 0 or 1
Default: True

remove_urls
Syntax: remove_urls=\
Description: Before cleaning remove html links.
Usage: Boolean value. True or False; true or false, t or f, 0 or 1
Default: True

remove_stopwords
Syntax: remove_stopwords=\
Description: Remove stopwords (i.e. common words like "the" and "I"), currently only supports english.
Usage: Boolean value. True or False; true or false, t or f, 0 or 1
Default: True

base_word
Syntax: base_word=\
Description: Turns on lemmatization or stemming, dependant on the value of base_type.
Usage: Boolean value. True or False; true or false, t or f, 0 or 1
Default: True

base_type
Syntax: base_type=\
Description: Sets the value for the type of word base to use, dependant on base_word being set to True. Lemmatization without POS tagging (option lemma) assumes every word is a noun but results in a comprable but faster output. Lemmatization with POS tagging (lemma_pos) is slower but more precice, also adds a new field of pos_tag. When set to lemma_pos this automatically sets force_nltk_tokenize argument to true. Porter Stemmer is used when the option is set to stem.
Usage: Possible values are lemma, lemma_pos, stem
Default: True

mv
Syntax: mv=\
Description: Returns the output as a multi-value field (ready for use with stats count), otherwise returns as a space seperated string.
Usage: Boolean value. True or False; true or false, t or f, 0 or 1
Default: True

pos_tagset
Syntax: pos_tagset=\
Description: Sets the option for the tagset used--Advanced Perceptron tagger (None) or universal.
Usage: None or universal
Default: None

term_min_len
Syntax: term_min_len=\
Description: Only terms greater than or equal to this number will be returned.
Usage: Interger value of minimum length of terms to return
Default: 0

ngram_range
Syntax: ngram_range=\-
Description: Returns new ngram column with range of ngrams specified if max is greater than 1.
Usage: Generally values like 1-2 (same as 2-2), 2-3, 2-4 are used, ngrams above 4 may not provide much value
Default: 1-1

ngram_mix
Syntax: mv=\
Description: Determines if ngram output is combined or separate columns. Defaults to false which results in separate columns
Usage: Boolean value. True or False; true or false, t or f, 0 or 1
Default: False

similarity

Description

A wrapper for NTLK distance metrics for comparing text to use in Splunk. Similarity (and distance) metrics can be used to tell how far apart to pieces of text are and in some algorithms return also the number of steps to make the text the same. These do not extract meaning, but are often used in text analytics to discover plagurism, conduct fuzzy searching, spell checking, and more. Defaults to using the Levenshtein distance algorithm but includes several other algorithms, include some set based algorithms. Can handle multi-valued comparisons with an option to limit to a given number of top matches. Multi-valued output can be zipped together or returned seperately.

Syntax

*| similarity textfield= comparefield= [algo=] [limit=] [mvzip=]

Required Arguments

textfield
Syntax: textfield=\
Description: Name of the field that will contain the source text to compare against. Field can be multi-valued.
Usage: Option only takes a single field

comparefield
Syntax: comparefield=\
Description: Name of the field that will contain the target text to compare against. Field can be multi-valued.
Usage: Option only takes a single field

Optional Arguments

algo
Syntax: algo=\
Description: Algorithm used for determining text similarity. Options are levenshtein, damerau, jaro, jaro_winkler, jaccard, and masi. Defaults to levenshtein. See included dashboard for explanation of each algorithm
Usage: Algorithm name, options are levenshtein, damerau, jaro, jaro_winkler, jaccard, and masi.
Default: levenshtein
Algorithm Explanations: levenshtein = Levenshtein Distance - Also known as edit distance, this algorithm is a measurement of how many steps (or operations) it takes to make one string into another. The steps include insertions, deletions and substitutions. damerau = Damerau-Levenshtein Distance - Also known as edit distance with transposition in that it is different from the traditional Levenshtein distance by also allowing transpositions (of two neighboring characters) as one of the edits. This can result in less steps for some comparisons, for example 'brain' and 'brian' would be 2 steps in the traditional Levenshtein algorithm but 1 step in the Damerau-Levenshtein. jaro = Jaro Similarity - Similarity algorithm that takes into account the length of the text comparisons, the number of characters that match (within a certain amount of positions based on length), as well as the number of transpositions. jaro_winkler = Like the Jaro similarity algorithm, this algorithm also takes into account the length of the text comparisons, the number of characters that match (within a certain amount of positions based on length), as well as the number of transpositions. However the Jaro-Winkler algorithm also gives higher precedence for matching a quantity of the beginning characters. jaccard = Jaccard Distance - A set based distance algorithm, measures shared members of each set. Because this is set based and example such as 'brain' and 'brian' would match completely because the set of each do not differentiate order. However set based algorithms can do well with sentences as any space separated words will be compared at the word level rather than the character level (a good place to use the cleartext command first with lemmatization). masi = MASI Distance - A set based distance algorithm whose name means Measuring Agreement on Set-Valued Items (MASI) for Semantic and Pragmatic Annotation. This algorithm is an implementation of the Jaccard Distance but gives weight to "monotonicity" (essentially repeating members). Because this is set based and example such as 'brain' and 'brian' would match completely because the set of each do not differentiate order. However set based algorithms can do well with sentences as any space separated words will be compared at the word level rather than the character level (a good place to use the cleartext command first with lemmatization).

limit
Syntax: limit=\
Description: When using multi-valued comparisons, this value limits the number of top matches returned. Usage: Interger value of minimum top matches to return
Default: 10

mvzip
Syntax: mvzip=\
Description: When using multi-valued comparisons, when this option is true the output is similar to using Splunk's mvzip option. Output is value:top_match_target for single-valued to multi-valued comparision and value:top_match_source>top_match_target for multi-valued to multi-valued comparision.
Usage: Boolean value. True or False; true or false, t or f, 0 or 1
Default: False

vader

Description

Sentiment analysis using Valence Aware Dictionary and sEntiment Reasoner. Using option full_output will return scores for neutral, positive, and negative which are the scores that make up the compound score (that is just returned as the field "sentiment". Best to feed in uncleaned data as it takes into account capitalization and punctuation.

Syntax

* | vader textfield=sentence [full_output=]

Required Arguments

textfield
Syntax: textfield=\
Description: The search field that contains the text that is the target of the analysis.
Usage: Option only takes a single field

Optional Arguments

full_output
Syntax: full_output=\
Description: Return scores for neutral, positive, and negative which are the scores that make up the compound score.
Usage: Boolean value. True or False; true or false, t or f, 0 or 1
Default: False

ML Algorithms

TruncantedSVD

Description

From sklearn. Used for dimension reduction (especially on a TFIDF). This is also known in text analytics as Latent Semantic Analysis or LSA. Returns fields prepended with "SVD_". See http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

Syntax

fit TruncatedSVD <fields> [into <model name>] k=<int>

The k option sets the number of components to change the data into. It is important that the value is less than the number of features or documents. The documentation on the algorithm recommends to be set to at least 100 for LSA.

LatentDirichletAllocation

Description

From sklearn. Used for dimension reduction. This is also known as LDA. Returns fields prepended with "LDA_". See http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

Syntax

fit LatentDirichletAllocation <fields> [into <model name>] k=<int>

The k option sets the number of components (topics) to change the data into. It is important that the value is less than the number of features or documents.

NMF

Description

From sklearn. Used for dimension reduction. This is also known as Non-Negative Matrix Factorization. Returns fields prepended with "NMF_". See http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html

Syntax

fit NMF <fields> [into <model name>] [k=<int>]

The k option sets the number of components (topics) to change the data into. It is important that the value is less than the number of features or documents.

TFBinary

Description

A modified implemenation of TfidfVectorizer from sklearn. The current MLTK version has TfidfVectorizer but it does not allow the option of turning off IDF or setting binary to True. This is to create a document-term matrix of whether the document has the given term or not. See http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Syntax

fit TFBinary <fields> [into <model name>] [max_features=<int>] [max_df=<int>] [min_df=<int>] [ngram_range=<int>-<int>] [analyzer=<str>] [norm=<str>] [token_pattern=<str>] [stop_words=english] [use_idf=<true|false>] [binary=<true|false>]

In this implementation, the following settings are already set in order to create a binary output: use_idf is set to False, binary has been set to True, and norm has been set to None. The rest of the settings and options are exactly like the current MLTK (3.4) implementation.

MinMaxScaler

Description

From sklearn. Transforms each feature to a given range. Returns fields prepended with "MMS_". See http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

Syntax

fit MinMaxScaler <fields> [into <model name>] [copy=<true|false>] [feature_range=<int>-<int>]

Default feature_range=0-1 copy=true.

LinearSVC

Description

From sklearn. Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples. See http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

Syntax

fit LinearSVC <fields> [into <model name>] [gamma=<int>] [C=<int>] [tol=<int>] [intercept_scaling=<int>] [random_state=<int>] [max_iter=<int>] [penalty=<l1|l2>] [loss=<hinge|squared_hinge>] [multi_class=<ovr|crammer_singer>] [dual=<true|false>] [fit_intercept=<true|false>]

The C option sets the penalty parameter of the error term.

ExtraTreesClassifier

Description

From sklearn. This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. See http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

Syntax

fit ExtraTreesClassifier <fields> [into <model name>] [random_state=<int>] [n_estimators=<int>] [max_depth=<int>] [max_leaf_nodes=<int>] [max_features=<int|auto|sqrt|log|None>] [criterion=<gini|entropy>]

The n_estimators option sets the number of trees in the forest, defaults to 10.

Support

Support will be provided through Splunkbase (click on Contact Developer) or Splunk Answers or submit an issue in Github. Expected responses will depend on issue and as time permits, but every attempt will be made to fix within 2 weeks.

Documentation

This README file constitutes the documenation for the app and will be kept upto date on Github as well as on the Splunkbase page.

Known Issues

Version 7.0.0 introduced an issue that causes errors in the ML Toolkit when using free or developer's license see https://answers.splunk.com/answers/654411/splunk-710-upgrade-of-free-version-finalizes-searc.html. Fixed as of 7.1.2. Splunk SDK crashes when too much data is sent through it, gets a buffer error. See https://github.com/splunk/splunk-sdk-python/issues/150. Workaround would be to used the sample command to down sample the data until it works.

How to add more languages

You can find other models directly from NLTK data website https://www.nltk.org/nltk_data/ (look for id: punkt) or any compatible models with NLTK and add it to the directory _bin/nltkdata/tokenizers/punkt and _bin/nltkdata/tokenizers/punkt/PY3

Release Notes

Added GMeans as Clustering Algorithm option in Clustering dashboard. Updated splunklib from 1.6.16 to 2.0.2. Updated nltk library from 3.4.5 to 3.9.1 (which also requires now using libraries from Python for Scientific Computing app). Added Sentiment and Named Entity dashboard studio version dashboards.