chaoss / grimoirelab-elk

GNU General Public License v3.0
58 stars 121 forks source link
data-enrichment elasticsearch grimoirelab hacktoberfest software-analytics

Welcome to GrimoireELK Build Status Coverage Status PyPI version

GrimoireELK is the component of GrimoireLab that interacts with the ElasticSearch database. Its goal is two-fold, first it aims at offering a convenient way to store the data coming from Perceval, second it processes and enriches the data in a format that can be consumed by Kibiter.

The Perceval data is stored in ElasticSearch indexes as raw documents (one per item extracted by Perceval). Those raw documents, which will be referred to as "raw data" in this documentation, include all information coming from the original data source which grants the platform to perform multiple analysis without the need of downloading the same data over and over again. Once raw data is retrieved, a new phase starts where data is enriched according to the data source from where it was collected and stored in ElasticSearch indexes. The enrichment removes information not needed by Kibiter and includes additional information which is not directly available within the raw data. For instance, pair programming information for Git data, time to solve (i.e., close or merge) issues and pull requests for GitHub data, and identities and organization information coming from SortingHat . The enriched data is stored as JSON documents, which embed information linked to the corresponding raw documents to ease debugging and guarantee traceability.

Raw data

Each raw document stored in an ElasticSearch index contains a set of common first level fields, regardless of the data source:

Enriched data

Each enriched index includes one or more types of documents, which are summarized below.

Fields

Each enriched document contains a set of fields, they can be (i) common to all data sources (e.g., metadata fields, time field), (ii) specific to the data source, (iii) related to contributor’s profile information (i.e., identity fields) or (iv) to the project listed in the Mordred projects.json (i.e., project fields).

Metadata fields

Identity fields

Project fields

Time field:

Demography fields:

Extra fields:

Data source specific fields

Details of the fields of each data source is available in the Schema folder.

Installation

There are several ways to install GrimoireELK on your system: packages or source code using Poetry or pip.

PyPI

GrimoireELK can be installed using pip, a tool for installing Python packages. To do it, run the next command:

$ pip install grimoire-elk

Source code

To install from the source code you will need to clone the repository first:

$ git clone https://github.com/chaoss/grimoirelab-elk
$ cd grimoirelab-elk

Then use pip or Poetry to install the package along with its dependencies.

Pip

To install the package from local directory run the following command:

$ pip install .

In case you are a developer, you should install GrimoireELK in editable mode:

$ pip install -e .

Poetry

We use poetry for dependency management and packaging. You can install it following its documentation. Once you have installed it, you can install GrimoireELK and the dependencies in a project isolated environment using:

$ poetry install

To spaw a new shell within the virtual environment use:

$ poetry shell

Running tests

Tests are located in the folder tests. In order to run them, you need to have in your machine instances (or Docker containers) of ElasticSearch and MySQL

Then you need to:

The full battery of tests can be executed with run_tests.py. However, it is also possible to execute a sub-set of tests by running the single test files (test_* files in the tests folder)

The tests can be run in combination with the Python package coverage. The steps below show how to do it:

$ pip3 install coveralls
$ cd <path-to-ELK>/tests
$ python3 -m coverage run run_tests.py --source=grimoire_elk 

pycharm-config-run_tests

Coverage will generate a file .coverage in the tests folder, which can be inspected with the following command:

cd <path-to-ELK>/tests
python3 -m coverage report -m

pycharm-config_report

The output will be similar to the following one:

Name                                                                                                                Stmts   Miss  Cover   Missing
--------------------------------------------------------------------------------------------------------------------------------------------------
.../ELK/grimoire_elk/__init__.py                                                                                       4      0   100%
.../ELK/grimoire_elk/_version.py                                                                                       1      0   100%