fabric8-analytics / fabric8-analytics-tagger

Keyword extractor for fabric8-analytics
Apache License 2.0
6 stars 10 forks source link

fabric8-analytics-tagger

image:https://ci.centos.org/view/Devtools/job/devtools-fabric8-analytics-tagger-fabric8-analytics/badge/icon[Build status, link="https://ci.centos.org/view/Devtools/job/devtools-fabric8-analytics-tagger-fabric8-analytics/"]

image:https://codecov.io/gh/fabric8-analytics/fabric8-analytics-tagger/branch/master/graph/badge.svg[Code coverage, link="https://codecov.io/gh/fabric8-analytics/fabric8-analytics-tagger"]

Keyword extractor and tagger for fabric8-analytics.

== Usage

For getting all available commands issue:

$ f8a_tagger_cli.py --help
Usage: f8a_tagger_cli.py [OPTIONS] COMMAND [ARGS]...

  Tagger for fabric8-analytics.

Options:
  -v, --verbose  Level of verbosity, can be applied multiple times.
  --help         Show this message and exit.

Commands:
  aggregate  Aggregate keywords to a single file.
  collect    Collect keywords from external resources.
  diff       Compute diff on keyword files.
  lookup     Perform keywords lookup.
  reckon     Compute keywords and stopwords based on stemmer and lemmatizer configuration.

To run a command in verbose mode (adds additional messages), run:

$ f8a_tagger_cli.py -vvvv lookup /path/to/tree/or/file

Verbose output will give you additional insides on steps that are performed during execution (debug mode).

== Installation using pip

$ git clone https://github.com/fabric8-analytics/fabric8-analytics-tagger && cd fabric8-analytics-tagger
$ python3 setup.py install  # or make install

== Tagging workflow

=== Collecting keywords - collect

The prerequisite for tagging is to collect keywords that are used out there by developers. This also means that tagger uses keywords that are considered as interesting ones by developers.

The collection is done by collectors (available in f8a_tagger/collectors). These collectors gather keywords and also count number of occurrences for gathered keywords. Collectors do not perform any additional post-processing, but rather gather raw keywords that are after that post-processed by the aggregate command (see bellow).

An example of raw keywords can be link:https://github.com/fabric8-analytics/fabric8-analytics-tags/blob/master/raw/pypi_tags.yaml[the following YAML] file that keeps keywords gathered in PyPI ecosystem.

=== Aggregating keywords - aggregate

If you take a look at raw keywords that are gathered by the collect command explained above, you can easily spot a lot of keywords that are written in a wrong way (they have broken encoding, multi-line keywords, numerical values, one letter keywords, ...). These keywords should be removed and other keywords that are present should be normalized and, if possible, there can be computed some obvious synonyms that can be present during keywords lookup phase.

The aggregate command handles:

The output of aggregate command is a single configuration file (could be JSON or YAML), that keeps the following (aggregated) entries:

An example of a keyword entries produced by the aggregate command could be:

machine-learning:
   synonyms:
    - machine learning
    - machinelearning
    - machine-learning
    - machine_learning
  - occurrence_count: 56
django:
   occurrence_count: 2654
   regexp:
    - '.*django.*'

The keywords.yaml file can be, of course, additionally manually changed as desired.

An example of automatically aggregated keywords.yaml can be found in link:https://github.com/fabric8-analytics/fabric8-analytics-tags/blob/master/pypi_tags.yaml[fabric8-analytics-tags] repository. This keywords.yaml file was computed based on link:https://github.com/fabric8-analytics/fabric8-analytics-tags/blob/master/raw/pypi_tags.yaml[collected raw keywords from PyPI] stated above.

=== Keywords lookup - lookup

The overall outcome of steps above is a single keywords.yaml file. This file, with stopwords.txt file keeping stopwords, is the input for the lookup command.

The lookup command does the whole heavy computation needed for keywords extraction. It utilizes link:http://www.nltk.org/[NLTK] for utilizing many natural language processing tasks.

The overall high-level overview of the lookup command can be described in the following steps:

  1. The first step is to do pre-processing of input files. Input files can be written in different formats. Except plaintext, there can be also used text files using different markup formats (such as Markdown, AsciiDoc, and such).

  2. After input pre-processing there is available plaintext without any markup formatting parts. This text is after that split into sentences. The actual split is done in a smart way (so "This Mr. Baron e.g. Mr. Foo." will be one sentence - not just split on dots).

  3. Sentences are tokenized into words. This tokenization is done, again, in a smart way ("e.g. isn't" is split into three tokens - "e.g", "is" and "n't".

  4. link:https://en.wikipedia.org/wiki/Lemmatisation[Lemmatization] - all words (tokens) are replaced with their representative as words appear in several inflected forms. Lemmatization uses NLTK's WordNet corpus (large lexical database of English words).

  5. After lemmatization there is performed link:https://en.wikipedia.org/wiki/Stemming[stemming]. Stemming ensures that different words are mapped to their word stem (e.g. "licensing", "license" is same). There are available different stemmers, check lookup --help for listing of all available ones. Check out link:https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html[Standford's NLP] for more insights on lemmatization and stemming.

  6. Unwanted tokens are removed - tokens are checked against stopwords file and if there is a match, unwanted tokens are removed. This step ensures that the lookup will perform faster and we also remove obviously wrong words that shouldn't be marked as keywords (words with high entropy).

  7. There are calculated ngrams for multi-word keywords by systematically concatenating tokens (e.g. tokens ["this", "is", "machine", "learning"] with ngram size equal to 2 create the following tokens: ["this", "is", "machine", "learning", "this is", "is machine", "machine learning"]. This step ensures that there can be performed lookup of multi-word keywords (such as "machine learning"). The actual ngrams size (bigrams, trigrams) is determined by keywords.yaml configuration file (based on synonyms), but can be explicitly stated using --ngram-size option.

  8. Actual lookup against keywords.yaml configuration file. Constructed array of tokens with ngrams is checked against keywords.yaml file. The output of this step is an array of found keywords during keywords mining.

  9. The last step performs scoring on found keywords based on their relevance in the system (based on occurrence count of the found keyword and occurrence count in the text).

You can watch check output of all steps by running tagger in debug mode by supplying multiple --verbose command line options. In that case tagger will report what steps are performed, what is input and the outcome. This can also help you when debugging what is going on when using tagger.

=== Working with keywords.yaml and stopwords

There are prepared few commands that can make your life easier when working with keywords database.

==== Using reckon command

This command will apply lemmatization and stemming on your keywords.yaml and stopwords.txt files. The output is after that printed to you to check form of keywords and stopwords that will be used during lookup (in respect to lemmatization and stemming).

Check reckon --help for more info on available options.

==== Using diff command

The diff command will give you an overview what has changed in keywords.yaml file. It simply prints added synonyms and regular expressions that differ in keywords.yaml files. Also there are reported missing/added keywords to help you see changes in your configuration files.

== Configuration files

=== keywords.yaml

File keywords.yaml keeps all keywords that are in a form of:

keyword:
  occurrence_count: 42
  synonyms:
    - list
    - of
    - synonyms
  regexp:
    - 'list.*'
    - 'o{1}f{1}'
    - 'regular[ _-]expressions?'

A keyword is a key to dictionary containing additional fields:

For example, if you would like to define keyword django that matches all words that contain "django", just define:

django:
  occurrence_count: 1339
  regexp:
    - '.*django.*'

Another example demonstrates synonyms. To define synonyms IP, IPv4 and IPv6 as synonyms to networking, just define the following entry:

networking:
  synonyms:
    - ip
    - ipv4
    - ipv6

Regular expressions conform to link:https://docs.python.org/3/library/re.html[Python regular expressions].

=== stopwords.txt

This file contains all stopwords (words that should be left out from text analysis) in raw/plaintext and regular expression format. All stopwords are listed one per line.

An example of stopwords file keeping stopwords ("would", "should" and "are"):

would
should
are

There can be also specified regular expression that describe stopwords.

An example of regular expression stopwords:

re: [0-9][0-9]*
re: https?://[a-zA-Z0-9][a-zA-Z0-9.]*.[a-z]{2,3}

In the example above, there are listed two regular expressions to define stopwords. The first one defines stopwords that consist purely of integer numbers (any integer number will be dropped from textual analysis). The latter example filters out any URL (the regexp is simplified).

Regular expressions conforms to link:https://docs.python.org/3/library/re.html[Python regular expressions].

== Development environment

If you would like to set up a virtualenv for your environment, just issue prepared make venv Make target:

$ make venv

After this command, there should be available virtual environment that can be accessed using:

$ source venv/bin/activate

And exited using:

$ deactivate

To run checks, issue make check command:

$ make check

The check Make target runs a set of linters provided by link:https://coala.io/[Coala]; there is also run pylint, pydocstyle. To execute only desired linter, run appropriate Make target:

$ make coala
$ make pylint
$ make pydocstyle

== Evaluating accuracy

Tagger does not use any machine learning technique to gather keywords. All steps correspond to data mining techniques so there is no "accuracy" that could be evaluated. Tagger simply checks for important, key words that are relevant (low entropy). The overall quality of keywords found is equal to quality of keywords.yaml file.

== Practices

== README.json

README.json is a format introduced by one task (GitReadmeCollectorTask) present in fabric8-analytics-worker. The structure of document is described by one JSON file containing two keys:

== Parsers

Parsers are used to transform README.json files to plaintext files. Their main goal is to remove any markup specific annotations and provide just plaintext that can be directly used for additional text processing.

You can see implementation of parsers in the f8a_tagger/parsers directory.

== Collectors

There is also present a set of collectors that collect keywords/topics/tags from various external resources such as PyPI, Maven central and such. These collectors produce a list of keywords with they occurrence count that can be later on used for keywords extraction.

All collectors are present under f8a_tagger/collectors package.

== Check for all possible issues

The script named check-all.sh is to be used to check the sources for all detectable errors and issues. This script can be run w/o any arguments:


./check-all.sh

Expected script output:


Running all tests and checkers Check all BASH scripts OK Check documentation strings in all Python source file OK Detect common errors in all Python source file OK Detect dead code in all Python source file OK Run Python linter for Python source file OK Unit tests for this project OK Done

Overal result OK

An example of script output when one error is detected:


Running all tests and checkers Check all BASH scripts Error: please look into files check-bashscripts.log and check-bashscripts.err for possible causes Check documentation strings in all Python source file OK Detect common errors in all Python source file OK Detect dead code in all Python source file OK Run Python linter for Python source file OK Unit tests for this project OK Done

Overal result One error detected!

== Coding standards

You can use scripts run-linter.sh and check-docstyle.sh to check if the code follows https://www.python.org/dev/peps/pep-0008/[PEP 8] and https://www.python.org/dev/peps/pep-0257/[PEP 257] coding standards. These scripts can be run w/o any arguments:


./run-linter.sh ./check-docstyle.sh

The first script checks the indentation, line lengths, variable names, white space around operators etc. The second script checks all documentation strings - its presence and format. Please fix any warnings and errors reported by these scripts.

List of directories containing source code, that needs to be checked, are stored in a file directories.txt

== Code complexity measurement

The scripts measure-cyclomatic-complexity.sh and measure-maintainability-index.sh are used to measure code complexity. These scripts can be run w/o any arguments:


./measure-cyclomatic-complexity.sh ./measure-maintainability-index.sh

The first script measures cyclomatic complexity of all Python sources found in the repository. Please see https://radon.readthedocs.io/en/latest/commandline.html#the-cc-command[this table] for further explanation how to comprehend the results.

The second script measures maintainability index of all Python sources found in the repository. Please see https://radon.readthedocs.io/en/latest/commandline.html#the-mi-command[the following link] with explanation of this measurement.

You can specify command line option --fail-on-error if you need to check and use the exit code in your workflow. In this case the script returns 0 when no failures has been found and non zero value instead.

== Dead code detection

The script detect-dead-code.sh can be used to detect dead code in the repository. This script can be run w/o any arguments:


./detect-dead-code.sh

Please note that due to Python's dynamic nature, static code analyzers are likely to miss some dead code. Also, code that is only called implicitly may be reported as unused.

Because of this potential problems, only code detected with more than 90% of confidence is reported.

List of directories containing source code, that needs to be checked, are stored in a file directories.txt

== Common issues detection

The script detect-common-errors.sh can be used to detect common errors in the repository. This script can be run w/o any arguments:


./detect-common-errors.sh

Please note that only semantical problems are reported.

List of directories containing source code, that needs to be checked, are stored in a file directories.txt

== Check for scripts written in BASH

The script named check-bashscripts.sh can be used to check all BASH scripts (in fact: all files with the .sh extension) for various possible issues, incompatibilities, and caveats. This script can be run w/o any arguments:


./check-bashscripts.sh

Please see https://github.com/koalaman/shellcheck[the following link] for further explanation, how the ShellCheck works and which issues can be detected.