GSoC Idea: Boosting data processing in GrimoireLab

valeriocos commented 4 years ago

GrimoireLab allows to produce analytics with data extracted from more than 30 tools used for contributing to Open Source development such as version control systems, issue trackers and forums. A common execution of GrimoireLab consists in collecting data from a given repository, processing and enriching the data obtained and finally visualizing it on dynamic Web dashboards. At the core of this process there is a component called ELK, which is in charge of integrating the data finally shown on the dashboards.

The evolution of GrimoireLab requires now to reshape some of the functionalities provided by ELK to improve its maintainability. This project idea is about refactoring and redesigning the core of ELK using popular libraries for data management and processing such as elasticsearch-py and pandas.

The aims of the project are as follows:

Learning about refactoring software code to improve its functionality.
Understanding the GrimoireLab components (Perceval, ELK, Mordred) and the corresponding tool-chain.
Replacing low-level libraries (e.g., requests) with popular ones used to interact with ElasticSearch.
Enabling the correct working of ELK for different version of ElasticSearch (>=6.1) and Open Distro for ElasticSearch (>= 0.9.0).
Reorganizing part of the ELK logic into coherent packages.
Improving the processing of Perceval data.

The aims will require working with Python, ELK and the ElasticSearch database.

Difficulty: Medium
Requirements: Python programming. Interest in software analytics. Willingness to understand GrimoireLab internals.
Recommended: Experience with ElasticSearch and Pandas would be convenient, but can be learned during the project.
Mentors: @inishchith, @Polaris000, @sduenas, @valeriocos, @zhquan

Microtasks

For becoming familiar with GrimoireLab, you can start by reading some documentation. You can find useful information at:

GrimoireLab Tutorial
Perceval: Software Project Data at Your Will
The GitHub repositories hosting the tools (https://github.com/chaoss/grimoirelab-perceval, https://github.com/chaoss/grimoirelab-sirmordred, https://github.com/chaoss/grimoirelab-...)

Once you're familiar with Grimoirelab, you can have a look at the following microtasks.

Microtask 0: Download PyCharm and get familiar with it (for instance, you can follow this tutorial).
Microtask 1: Set up Perceval to be executed from PyCharm.
Microtask 2: Create a Python script to execute Perceval via its Python interface using the Git and GitHub backends. Feel free to select any target repository.
Microtask 3: Based on the JSON documents produced by Perceval and its source code, try to answer the following questions:
- What is the meaning of the JSON attribute timestamp?
- What is the meaning of the JSON attribute updated_on?
- What is the meaning of the JSON attribute origin?
- What is the meaning of the JSON attribute category?
- How many categories do the Git and GitHub backends have?
- What is the meaning of the JSON attribute uuid?
- What is the meaning of the JSON attribute search_fields?
- What is stored in the attribute data of each JSON document produced by Perceval?
- Identify the code in charge of dealing with remote APIs and explain its logic.
- Which is the folder that stores the archives generated by Perceval?
Microtask 4: Set up a dev environment to work on GrimoireLab. Have a look to https://github.com/chaoss/grimoirelab-sirmordred#setting-up-a-pycharm-dev-environment.
Microtask 5: Execute micro-mordred to collect and enrich data from any Git repository.
Microtask 6: Execute micro-mordred to obtain data from the study enrich_areas_of_code for any Git repository.
Microtask 7: Execute micro-mordred to collect and enrich data from any GitHub repository, making sure that no archives are created by Perceval.
Microtask 8: In your machine, run the tests for ELK within PyCharm. If you succeed, you can try to run the coverage package on the ELK tests and report the details of each file.
Microtask 9: Submit at least a PR to one of the GrimoireLab repositories to fix an issue, improve the documentation, etc.
Microtask 10: Submit a PR to ELK to increase the test coverage of one or more files located at https://github.com/chaoss/grimoirelab-elk/tree/master/grimoire_elk/enriched

heming6666 commented 4 years ago

Thank you @valeriocos for you reply. I have moved the discussion to here.

imnitishng commented 4 years ago

Hi @valeriocos, can you tell which library is required to be implemented as stated here?

Replacing low-level libraries (e.g., requests) with popular ones used to interact with ElasticSearch.

Is it the elasticsearch-py library or something like httpx library

valeriocos commented 4 years ago

The idea is to use elasticsearch-py, however if you find other good candidates don't hesitate to add them to your proposal and/or share them here. Thanks!

imnitishng commented 4 years ago

But I've seen elasticseach-py and elasticsearch-dsl already being used in the project.

Replacing low-level libraries (e.g., requests) with popular ones used to interact with ElasticSearch.

And since it mentions requests so maybe we need a better faster asynchronous HTTP library like aiohttp? Is that not the objective?

valeriocos commented 4 years ago

But I've seen elasticseach-py and elasticsearch-dsl already being used in the project.

That's true, but there are still some parts of the code that rely on requests. One of the goal of this idea is to reduce the logic that interacts with ElasticSearch. Good candidates are elasticseach-py and elasticsearch-dsl because they provide already some high-level methods (that remove boilerplate code)

And since it mentions requests so maybe we need a better faster asynchronous HTTP library like aiohttp? Is that not the objective?

Any library (Aiohttp or other ones) can be good a candidate if it performs better than requests (in the specific case of GrimoireELK) and/or allows to write clean code to interact with ElasticSearch.

imnitishng commented 4 years ago

Hi and also wanted to know about the ODFE implementation, everywhere throughout grimoirelab I've seen Elasticsearch version 6.1 being used

Enabling the correct working of ELK for different version of ElasticSearch (>=6.1) and Open Distro for ElasticSearch (>= 0.9.0).

Project demands working of ODFE, I wanted to know what the current progress with Elasticseach and ODFE is. This PR adds support for using ODFE 1.2.0 without Kibiter panels. So do we need to fix this panels and other issues for correct working of Elasticseach 7.2.0 and ODFE 1.2.0 or do we start over with a new approach to support ODFE 0.10.0 with Elasticseach 6.8.1.

One of the goal of this idea is to reduce the logic that interacts with ElasticSearch. Good candidates are elasticseach-py and elasticsearch-dsl because they provide already some high-level methods (that remove boilerplate code)

Okay, I get it. Thank You.

As far as I have read aiohttp should be performing better but I'm afraid the code won't be as clean as it is, because the library performs faster but at the cost of more code lines.

valeriocos commented 4 years ago

Hi and also wanted to know about the ODFE implementation, everywhere throughout grimoirelab I've seen Elasticsearch version 6.1 being used

Hi @imnitishng ! ODFE is supported by ELK (https://github.com/chaoss/grimoirelab-elk/blob/master/.travis.yml#L123), but not by panels.

Project demands working of ODFE, wanted to know what the current progress with Elasticseach and ODFE is, this PR adds Support for using ODFE 1.2.0 without Kibiter panels, so what exactly does this demand. I need to fix this panels and other issues for correct working of Elasticseach 7.2.0 and ODFE 1.2.0 or do we start over with a new approach to support ODFE 0.10.0 with Elasticseach 6.8.1.

This is something that should be discussed/evaluated at the beginning of the intership. ATM, there are 2 possible approaches to complete the integration with ODFE.

The first one is to move all the panels management to a different component. This means that ELK and mordred should be refactored to remove all the code dealing with aliases and panels upload. Under this context, ELK would become a fast processing library on top of ElasticSearch DBs. The second approach (which is you pointed out) is to modify Kidash to make sure that the panels can be uploaded to ODFE (this implies to fix also other issues that may pop up when migrating the panels).

Okay, I get it. Thank You.

You're welcome!

As far as I have read aiohttp should be performing better but I'm afraid the code won't be as clean as it is, because the library performs faster but at the cost of more code lines.

I see, how much faster does it perform wrt requests?

imnitishng commented 4 years ago

This is something that should be discussed/evaluated at the beginning of the intership.

Okay thank you very much.

aiohttp allows sending requests in series but without waiting for the first reply to come back before sending the new one unlike requests along with many other decoding optimizations. The below results were obtained sending request to httpbin.org.

requests with session called: 11.22s
aiohttp called: 1.19s

kshitij3199 commented 4 years ago

Hi @valeriocos, For GSoC proposal, can you please tell what all things we have to mention apart form MicroTasks.

Do we have to discuss the libraries that we will use to interact with ElasticSearch and other things like how we will Improve the processing of Perceval data.

valeriocos commented 4 years ago

Hi @kshitij3199 !

Do we have to discuss the libraries that we will use to interact with ElasticSearch and other things like how we will Improve the processing of Perceval data.

Yes, the proposal should include the libraries/technologies you would like to use and a plan (with a timeline of actions/tasks) to achieve the goals of the project. For instance:

understand ELK's architecture
identify where the library requests is used
replace the use of the library requests in the ELK module XYZ with
identify where the data is processed
replace the data processing call with
identify where aliases are set/used
...

Let me know if this answers your question, thanks!

kshitij3199 commented 4 years ago

Thankyou @valeriocos, I will soon upload my GSoC proposal (so that we discuss and update it if required)

valeriocos commented 4 years ago

You're welcome @kshitij3199

imnitishng commented 4 years ago

Hi @valeriocos. I'd like to know a bit more about these objectives. Can you explain these in more detail? I don't seem to get these now.

Reorganizing part of the ELK logic into coherent packages.
Improving the processing of Perceval data.

valeriocos commented 4 years ago

Hi @imnitishng ! Yes, sure

Reorganizing part of the ELK logic into coherent packages.

ELK does many things in the same module. Let's take as an example the gitlab enricher (https://github.com/chaoss/grimoirelab-elk/blob/master/grimoire_elk/enriched/gitlab.py). As you can see, we have methods to:

deal with identities (e.g., get_identities, get_item_sh)
derive additional info (e.g., __add_milestone_info)
add metadata fields (e.g., get_grimoire_fields)
deal with studies[*] (e.g., enrich_onion)

A possible idea is to evaluate how the logic above can be reorganized in different modules to ease the understanding and the evolution of ELK.

[*] a study is new information derived from existing indexes and (i) added to an existing index or (ii) stored in a new index.

Improving the processing of Perceval data.

ELK creates enriched data by processing each Perceval document via the method get_rich_item (present in each enricher). At the same time, in some cases ELK relies on cereslib to create study data. Cereslib uses pandas to manipulate the data which is a popular data processing library.

A possible idea is to evaluate (i) if/how the approach implemented in cereslib can be extended to the creation of the enriched data and (ii) the use of pandas (or other similar libraries) to create enriched data.

Let me know if this solves your doubts, thanks!

imnitishng commented 4 years ago

Thank you so much @valeriocos, that helped.

kshitij3199 commented 4 years ago

Hi @valeriocos For the following aim, what I think we can do is

Replacing low-level libraries (e.g., requests) with popular ones used to interact with ElasticSearch.

Replace Library Request with elasticsearch-dsl in grimoirelab-elk ( elasticsearch-dsl and elasticsearch-py are both Python API client for elasticsearch. But what I think is, in elasticsearch-dsl it is more convenient to write queries than elasticsearch-py)

Reorganizing part of the ELK logic into coherent packages.

Identity related method and Study related method present in Enrich class should be moved to different modules. Because they need some methods that is not needed for the Enrich class and also they are increasing line of code for the class . So it will be better if we have different modules for them

Some Question

Enabling the correct working of ELK for different version of ElasticSearch (>=6.1) and Open Distro for ElasticSearch (>= 0.9.0).

Accordings to open distro doc, it provide features like elasticsearch, kibana, security, alerting, sql etc. so are we using open distro as a plugin?

valeriocos commented 4 years ago

Accordings to open distro doc, it provide features like elasticsearch, kibana, security, alerting, sql etc. so are we using open distro as a plugin?

OpenDistro leverages on ElasticSearch and add some additional features.

The initial goal is to make sure that ELK and possibly GrimoireLab can work with OpenDistro (in particular with its elasticsearch and kibana). Alerting and other features available in OpenDistro (but not in ElasticSearch) can be evaluated during the intership.

kshitij3199 commented 4 years ago

Hi @valeriocos, I am getting bit confused with open Distro part

Basically the working of is GrimoireLab 1) Obtain data from a data source (like git or github) using Perceval 2) GrimoireELK stores this data as raw indexes and then processes it and make enriched indexes( with the hep of sorting hat etc) 3) this enriched indexes are passed to kibiter for visualisation

So Do we want that Open Distro should work with GrimoireELK in order to produce enriched indexes ?

valeriocos commented 4 years ago

Basically the working of is GrimoireLab ...

Yes

So Do we want that Open Distro should work with GrimoireELK in order to produce enriched indexes ?

Yes

@kshitij3199 we can talk on IRC tomorrow about the doubts you have about ODFE, WDYT?

kshitij3199 commented 4 years ago

Yes , sure. @valeriocos Can you please tell the time

kshitij3199 commented 4 years ago

one more thing @valeriocos

The initial goal is to make sure that ELK and possibly GrimoireLab can work with OpenDistro (in particular with its elasticsearch and kibana)

Why are we saying that OpenDistro should work with kibana ?? I think open Distro should work with GrimoireELK and the task is to produce enrich indexes which then later can be feeded in kibana

I mean that there is no connection between open Distro and kibana ?

valeriocos commented 4 years ago

https://github.com/chaoss/grimoirelab/issues/285#issuecomment-603336536

I'm available tomorrow from 10h30 until 17h30 (Madrid, Spain). Please pick the timeframe that best suits you.

Why are we saying that OpenDistro should work with kibana ?? I think open Distro should work with GrimoireELK and the task is to produce enrich indexes which then later can be feeded in kibana

I agree with you that GrimoireELK should produce the enriched indexes. In the current implementation, these indexes are consumed by some dashboards, which are automatically uploaded by GrimoireLab to Kibiter (a downstream of Kibana). It is important to make sure that these dashboards are uploaded even when using the Kibana version of ODFE.

Some info is available at https://github.com/chaoss/grimoirelab/issues/285#issuecomment-602255449

Let me know if this answers your doubts, thanks

kshitij3199 commented 4 years ago

Thankyou @valeriocos ,it answers my doubts.

valeriocos commented 4 years ago

you're welcome @kshitij3199 ! Tomorrow I'll be on IRC, in case you want to discuss something.

kshitij3199 commented 4 years ago

thank you @valeriocos, if something comes that need to be discuss, i will message you on IRC

kshitij3199 commented 4 years ago

Hi @valeriocos , (I tried searching you on IRC but couldn't find, maybe wrong timing)

Improving the processing of Perceval data.

In this aim are we expected to rewrite/modify studies using cereslib just like area of code and onion study or something different is expected ?

valeriocos commented 4 years ago

Sorry @kshitij3199 , I didn't see this notification!

In this aim are we expected to rewrite/modify studies using cereslib just like area of code and onion study or something different is expected ?

The first option is the preferred one.

kshitij3199 commented 4 years ago

Hi @valeriocos , for example, in studies like enrich_demography ,enrich_forest_activity, enrich_feelings etc we have to 1) Make sure that all Requestlibrary are replaced by elasticsearch-py or dsl

Improving the processing of Perceval data. 2) (This is one that is confusing me ) So in this part, for studies mentioned above, we have to get data from elasticsearch and store it in dataframe, using the same logic as before, enriched the data using pandas (use function from cereslib where possible ), store the enriched data back to elasticsearch. ??? Is this approach we should follow?? If I am wrong can you please describe how should be the approach/procedure? Thankyou

valeriocos commented 4 years ago

HI @kshitij3199 , thank you for your time in understanding and exploring the problems of this idea!

1.Make sure that all Request library are replaced by elasticsearch-py or dsl

Yes, if you notice, the code to implement the studies consist of a common part to read and process the items. This part could be generalized (moved maybe to cereslib or a new ELK module) and rewritten with elasticsearch-py or dsl.

(This is one that is confusing me )

Yes, the approach you described is the one that should be followed.

Thanks!

kshitij3199 commented 4 years ago

Thankyou @valeriocos ,

I was thinking that

Identity related method and Study related method present in Enrich class should be moved to different modules

so when we will create a new module for study related method we can take care of following

This part could be generalized (moved maybe to cereslib or a new ELK module) and rewritten with elasticsearch-py or dsl.

So does it means that ,all the 3 tasks have to be done at same time ( when we are creating new module for study related method)??

1) Replacing low-level libraries (e.g., requests) with popular ones used to interact with ElasticSearch. 2)Reorganizing part of the ELK logic into coherent packages. 3)Improving the processing of Perceval data.

valeriocos commented 4 years ago

So does it means that ,all the 3 tasks have to be done at same time ( when we are creating new module for study related method)??

They can be splitted in sub-tasks and the integration to ELK can be done incrementally. Does it answer your question, @kshitij3199 ?

kshitij3199 commented 4 years ago

Thankyou @valeriocos , it answer my question.

I agree with you that GrimoireELK should produce the enriched indexes. In the current implementation, these indexes are consumed by some dashboards, which are automatically uploaded by GrimoireLab to Kibiter (a downstream of Kibana). It is important to make sure that these dashboards are uploaded even when using the Kibana version of ODFE.

Can you please tell what are dashboards here these indexes are consumed by some dashboards?are they visualiztion like different graphs,charts ? and how do GrimoireLab upload them to kibiter?
So when are using ODFE, is it necessary to use its Kibana version. Can't we use kibiter ? If we use kibiter, can you tell what things we will have to fix?

valeriocos commented 4 years ago

You're welcome!

Can you please tell what are dashboards here these indexes are consumed by some dashboards?

The dashboards are available at https://github.com/chaoss/grimoirelab-sigils

are they visualiztion like different graphs,charts ?

Yes, for instance pie charts, bar charts, tables, and so on.

and how do GrimoireLab upload them to kibiter?

This is done by the task_panels (https://github.com/chaoss/grimoirelab-sirmordred/blob/master/sirmordred/task_panels.py). Under the hood, it call kidash, which is in charge of taking the dashboards and index patterns (see sigils repo) and save them in the index .kibana.

So when are using ODFE, is it necessary to use its Kibana version. Can't we use kibiter ?

ATM we don't have a kibiter version for ODFE. However consider that kibiter is a downstream of kibana and it adds to kibana some additional plugins. Thus, there is basically no difference between kibiter and kibana in terms of common functionalities (how to create a dashboard, a visualization, and so on).

If we use kibiter, can you tell what things we will have to fix?

The same thing will have to fix if we use Kibana >= 7.x (not sure if this applies also to Kibana >= 6.8), which is the way kidash uploads the dashboard/index patterns to the .kibana. However, please note that this fix should be evaluated as commented at https://github.com/chaoss/grimoirelab/issues/285#issuecomment-602255449

kshitij3199 commented 4 years ago

Thankyou @valeriocos for detailed answer

valeriocos commented 4 years ago

you're welcome @kshitij3199 !

imnitishng commented 4 years ago

Hi @valeriocos, can we connect on IRC for some time? Please tell me when you're free.

valeriocos commented 4 years ago

Hi @imnitishng , sure! At 15h00 Madrid time (around 1h30 from now) is it OK for you?

imnitishng commented 4 years ago

Yea sure.

kshitij3199 commented 4 years ago

Hi @valeriocos , This is my first draft of GSoC proposal.

Please see it and let me know what things need to be corrected. Thankyou

imnitishng commented 4 years ago

@valeriocos I have shared the draft proposal with the organization, please have a look.

valeriocos commented 4 years ago

Hi @imnitishng , can you share the link to the draft?

kshitij3199 commented 4 years ago

Hi @valeriocos , can you please tell what things need to be change in my first draft of GSoC proposal

Hi @valeriocos , This is my first draft of GSoC proposal.

Please see it and let me know what things need to be corrected. Thankyou

imnitishng commented 4 years ago

Oh okay sure @valeriocos https://docs.google.com/document/d/1_9WaTWfe_qKmKcdbusWpbkJ4Wk7xIxmXNReedKqSvZg/edit?usp=drivesdk

kshitij3199 commented 4 years ago

Hi @valeriocos , can you please tell when you will be on IRC, I want to discuss few things?

valeriocos commented 4 years ago

Hi @kshitij3199 , in 20 minutes can be OK (12h30 Madrid time), can be OK?

kshitij3199 commented 4 years ago

yes, its fine

inishchith commented 4 years ago

@kshitij3199 @imnitishng

I tried opening your draft proposal links in order to review them, but I guess it's restricted. Please do let me know once you have granted public access.

/cc @valeriocos

imnitishng commented 4 years ago

I'm really sorry, might have been a mistake. I've made it public again. Please have a look! Thank You @inishchith !

inishchith commented 4 years ago

@imnitishng No worries.

I'll have a look at it

GeorgLink commented 4 years ago

Should I submit a proposal via the GSoC portal?

Yes.

Proposals must be submitted in two places for CHAOSS: The GSoC portal and our interest page.

chaoss / grimoirelab

GSoC Idea: Boosting data processing in GrimoireLab #285

Microtasks