chaoss / grimoirelab

GrimoireLab: platform for software development analytics and insights
https://chaoss.github.io/grimoirelab/
GNU General Public License v3.0
476 stars 180 forks source link

GSoC Idea: Boosting data processing in GrimoireLab #285

Closed valeriocos closed 4 years ago

valeriocos commented 4 years ago

GrimoireLab allows to produce analytics with data extracted from more than 30 tools used for contributing to Open Source development such as version control systems, issue trackers and forums. A common execution of GrimoireLab consists in collecting data from a given repository, processing and enriching the data obtained and finally visualizing it on dynamic Web dashboards. At the core of this process there is a component called ELK, which is in charge of integrating the data finally shown on the dashboards.

The evolution of GrimoireLab requires now to reshape some of the functionalities provided by ELK to improve its maintainability. This project idea is about refactoring and redesigning the core of ELK using popular libraries for data management and processing such as elasticsearch-py and pandas.

The aims of the project are as follows:

The aims will require working with Python, ELK and the ElasticSearch database.

Microtasks

For becoming familiar with GrimoireLab, you can start by reading some documentation. You can find useful information at:

Once you're familiar with Grimoirelab, you can have a look at the following microtasks.

valeriocos commented 4 years ago

https://github.com/chaoss/grimoirelab/issues/285#issuecomment-596135277

@snack0verflow, sorry it seems there is a problem with the meaning of the enriched field time_to_merge_request_response. The description at https://github.com/chaoss/grimoirelab-elk/blob/master/schema/github_pull_requests.csv says that it is the Time to merge a Pull Request in days, however looking at the code (as you pointed out) the meaning is basically the time to first attention to the pull request (which is the diff between the creation date and the first comment of someone that didn't submit the PR). If the explanation makes sense, could you please submit a PR to fix the description in the schema?

The time to merge you refer to should be the code_merge_duration (https://github.com/chaoss/grimoirelab-elk/blob/13982d4024ffcaf2d0e42a26194ec4725619fed8/grimoire_elk/enriched/github.py#L507) or time_to_close_days (https://github.com/chaoss/grimoirelab-elk/blob/13982d4024ffcaf2d0e42a26194ec4725619fed8/grimoire_elk/enriched/github.py#L439).

Please have a look at the schema about the github pull request data (https://github.com/chaoss/grimoirelab-elk/blob/master/schema/github_pull_requests.csv). There, you will find a description for each enriched field. If you find some imprecisions, please report them and we can evaluate them together. Thanks!

valeriocos commented 4 years ago

https://github.com/chaoss/grimoirelab/issues/285#issuecomment-596135277

@snack0verflow thank you for pointing this out!

The tests in travis (e.g., https://github.com/chaoss/grimoirelab-elk/blob/master/.travis.yml#L31) are executed only on https://github.com/chaoss/grimoirelab-elk/tree/master/grimoire_elk (e.g., https://coveralls.io/builds/29180665). On the other hand, the documentation at https://github.com/chaoss/grimoirelab-elk#running-tests should do the same but neither the param --source=grimoire_elk or --include=grimoire_elk limit the info in the file .coverage. I tried with --source=*grimoire_elk* and --include=*grimoire_elk* and it doesn't work either (feel free to investigate more). Nevertheless, running report -m --include=*grimoire_elk* produces only the info for the grimoire_elk package.

It would be great if you can submit a PR to update the doc at https://github.com/chaoss/grimoirelab-elk#running-tests ? The changes in the PR could be the ones below, however feel free to propose other ones.

WDYT?

Thanks

snack0verflow commented 4 years ago

@valeriocos Sorry for the late reply, but what do you mean by

(the image should be updated accordingly).

Thanks EDIT: Oh you mean the screenshot

valeriocos commented 4 years ago

No worries,

EDIT: Oh you mean the screenshot

Sorry for not being precise, yes :)

kshitij3199 commented 4 years ago

Respected Mentors @valeriocos, @Polaris000 ,@inishchith,@sduenas,@zhquan I am currently doing MicroTask 2

Microtask 2: Create a Python script to execute Perceval via its Python interface using the Git and GitHub backends. Feel free to select any target repository.

I have executed Perceval via its Python interface using the Git. But facing a issue when using Guthub backend. I have selected my own repo as target repo.

Issue: When I am calling fetch method (category='issue') for calculating total number of issues in a repo, it is not giving correct number of issues. I am thinking that it is giving me sum of issues and pull requests,instead of only issues, I am saying so because when I have added one more pull request in my repo and then again run the code, the number of issues increased by one, which should not happen. fetch method (categories ="pull_request" ) is working fine.

This is my code segment

# Calling fetch method for getting information from github repo and calculating total number of issue 
REPOSITORY_NAME = "DSA_LAB"
github_backend = GitHub(owner="kshitij3199", api_token=[config.info["API_Token"]], repository=REPOSITORY_NAME)
from_date = datetime(2020, 1, 1)
to_date = datetime(2020,3,10)
range_issues = github_backend.fetch(category='issue', from_date=from_date, to_date=to_date)
range_issues_list = list(range_issues)
n_issues = len(range_issues_list)
print("NUMBER OF ISSUES: ", n_issues)

Screenshot from 2020-03-10 09-22-50 The total number of issue it is showing is 4 but there is only 1 issue and 3 pull request. When adding one more pull request, the number of total issues get increased by one. Screenshot from 2020-03-10 10-02-47 Showing Total issue as 5 but there is only 1 issue and 4 pull request.

vchrombie commented 4 years ago

https://github.com/chaoss/grimoirelab/issues/285#issuecomment-596825820

Hi @kshitij3199

I am thinking that it is giving me sum of issues and pull requests,instead of only issues

You are absolutely correct. In GitHub, every pull request is considered as an issue. You can read more about it from here, https://developer.github.com/v3/pulls/#labels-assignees-and-milestones.

I hope this helps you. :slightly_smiling_face:

valeriocos commented 4 years ago

Thank you @vchrombie for the clarification

kshitij3199 commented 4 years ago

Thank you very much @vchrombie for your answer. Now I can start with Microtask 3 :slightly_smiling_face:

kshitij3199 commented 4 years ago

Respected Mentors,

In the GitHub backend for Perceval, the parameter, API_Token accepts a list of token as mention in chaoss/grimoirelab-perceval#546 but I think this is not mention in perceval docs. Should I send a PR to add this information in perceval docs?

valeriocos commented 4 years ago

Hi @kshitij3199 , thank you for raising this question. The perceval docs is automatically updated, there is already an issue open about it (https://github.com/chaoss/grimoirelab-perceval/issues/625). If you want, you could send a PR to improve the doc at https://chaoss.github.io/grimoirelab-tutorial/perceval/github.html#retrieving-from-a-python-script, WDYT?

kshitij3199 commented 4 years ago

Hi @valeriocos, in the doc at https://chaoss.github.io/grimoirelab-tutorial/perceval/github.html#retrieving-from-a-python-script Details regarding API_Token is already mentioned.

Include the token in a list, api_token=[“XXXXXX”, “XXXXXX”, …..] as it is possiblity to pass a list of tokens to get over rate limits. To run this script, just run (of course, substituting “XXXXX” for your token):

If I found some other things to improve in the docs, I would definitely send a PR. Thankyou for your time

valeriocos commented 4 years ago

Sorry for not being precise, I was referring to improve the snippet of code there:

#! /usr/bin/env python3

import argparse

from perceval.backends.core.github import GitHub

# Parse command line arguments
parser = argparse.ArgumentParser(
    description = "Simple parser for GitHub issues and pull requests"
    )
parser.add_argument("-t", "--token",
                    '--nargs', nargs='+',
                    help = "GitHub token") <------- "GitHub tokens"
parser.add_argument("-r", "--repo",
                    help = "GitHub repository, as 'owner/repo'")
args = parser.parse_args()

# Owner and repository names
(owner, repo) = args.repo.split('/')

# create a Git object, pointing to repo_url, using repo_dir for cloning <----- # create a GitHub object, passing the owner and repository, plus a list of tokens. Note that not passing a list will throw an error
repo = GitHub(owner=owner, repository=repo, api_token=args.token)
# fetch all issues/pull requests as an iterator, and iterate it printing
# their number, and whether they are issues or pull requests
for item in repo.fetch():
    if 'pull_request' in item['data']:
        kind = 'Pull request'
    else:
        kind = 'Issue'
    print(item['data']['number'], ':', kind)
kshitij3199 commented 4 years ago

Thank you @valeriocos for your clarification, I will improve the snippet of the code there and will send you a PR.

kshitij3199 commented 4 years ago

Hello @valeriocos,

I have set up dev environment to work on GrimoireLab and executed micro-mordred. After which I got following screen in kibana

Screenshot from 2020-03-11 01-57-25

But for some fields like jetkins, git, github_issues etc no data is available(I have tried changing time duration). So just want to ask whether I have done some mistake in setting up GrimoireLab or is it fine.

Screenshot from 2020-03-11 02-14-05

vchrombie commented 4 years ago

https://github.com/chaoss/grimoirelab/issues/285#issuecomment-597309079

Hi @kshitij3199

I want to ask which backend did you run. If you are running only git backend (--backend git) then, in that case, you cannot retrieve other things. Also, make sure of the configurations needed for that. (I mean projects.json and setup.cfg). You can find them here.

I hope I have answered your question. :slightly_smiling_face:

kshitij3199 commented 4 years ago

Thankyou @vchrombie for such a quick response. I will look into my setup.cfg and project.json file for any discrepancy and reply back to you.

imnitishng commented 4 years ago

Hi @kshitij3199 I've been having problems setting up docker-compose for micro-mordred, you seem to get it, could you send your docker-config.yml file here. I can't figure out what the problem with my system is.

valeriocos commented 4 years ago

Hi @kshitij3199,

Based on the datasource declared in the setup.cfg, the Mordred task panels automatically imports the corresponding dashboards and add them to top menu (ref: https://github.com/chaoss/grimoirelab-sirmordred/blob/master/sirmordred/task_panels.py#L239, https://github.com/chaoss/grimoirelab-sirmordred/blob/master/sirmordred/task_panels.py#L495). Thus, some dashboards may be empty if you execute the raw/enrich phases with micro-mordred (on some data sources) and the phase --panels.

Hope this helps

kshitij3199 commented 4 years ago

Hi @imnitishng , you can find my docker-compose.yml file here I Hope it helps you :slightly_smiling_face:.

kshitij3199 commented 4 years ago

Thankyou @valeriocos @vchrombie for your help, I was using only git backend( -- backend git), because of which I was unable to retrieve other data.

But Today when I am executing micro-mordred I am facing following issues

Something went wrong when adding an alias on http://localhost:9200/git_chaoss. Alias not set. 400 Client Error: Bad Request for url: https://admin:admin@localhost:9200/_aliases

2020-03-11 18:48:47,751 [git] Problem executing study enrich_areas_of_code:git, RequestError(400, 'search_phase_execution_exception', 'No mapping found for [metadata__timestamp] in order to sort on') 2020-03-11 18:48:47,751 RequestError(400, 'search_phase_execution_exception', 'No mapping found for [metadata__timestamp] in order to sort on') Process finished with exit code 255

valeriocos commented 4 years ago

can you copy here the output of these commands? thanks

curl -XGET https://admin:admin@localhost/_aliases?pretty -k
curl -XGET https://admin:admin@localhost/_cat/indices -k
kshitij3199 commented 4 years ago

@valeriocos , this is the output i am getting for both the command

curl: (7) Failed to connect to localhost port 443: Connection refused

valeriocos commented 4 years ago

can you try with -k ? (I have just edited the comment above)

kshitij3199 commented 4 years ago

For the command curl -XGET https://admin:admin@localhost:9200/_aliases?pretty -k output is

{ "github_issues_chaoss" : { "aliases" : { } }, "cocom_chaoss" : { "aliases" : { "cocom-raw" : { } } }, "git_chaoss" : { "aliases" : { } }, "git-aoc_chaoss_enriched" : { "aliases" : { } }, "git_chaoss_enriched" : { "aliases" : { "demographics" : { } } }, "searchguard" : { "aliases" : { } }, "github_issues_chaoss_enriched" : { "aliases" : { } }, ".kibana" : { "aliases" : { } } }

for command curl -XGET https://admin:admin@localhost:9200/_cat/indices -k

output is yellow open github_issues_chaoss wE_5RzYlSxWreyR4KKtkCw 5 1 0 0 1.1kb 1.1kb yellow open cocom_chaoss cRQdra5eQFiEO6BxX7am7w 5 1 57 1 146.7kb 146.7kb yellow open git_chaoss 2RwIDj3xQHKNGYPJT7M5NQ 5 1 0 0 1.2kb 1.2kb yellow open git-aoc_chaoss_enriched 5K-P9oJSQSmr2ehfTo1yoA 5 1 0 0 1.2kb 1.2kb yellow open git_chaoss_enriched 9JwfibY7RYiOxwP3bYtNsA 5 1 0 0 1.2kb 1.2kb green open searchguard t2SwaGUmQtKiA9V2PiA-TA 1 0 0 5 33.6kb 33.6kb yellow open github_issues_chaoss_enriched V1jXyE9rT6-i2M9smmK2fw 5 1 0 0 1.1kb 1.1kb yellow open .kibana XgDSRZXnT_mExyBlwVLt4Q 1 1 319 0 347.2kb 347.2kb

valeriocos commented 4 years ago

Something went wrong when adding an alias on http://localhost:9200/git_chaoss. Alias not set. 400 Client Error: Bad Request for url: https://admin:admin@localhost:9200/_aliases

Did you modify this file https://github.com/chaoss/grimoirelab-sirmordred/blob/master/aliases.json by chance?

2020-03-11 18:48:47,751 [git] Problem executing study enrich_areas_of_code:git, RequestError(400, 'search_phase_execution_exception', 'No mapping found for [metadata__timestamp] in order to sort on') 2020-03-11 18:48:47,751 RequestError(400, 'search_phase_execution_exception', 'No mapping found for [metadata__timestamp] in order to sort on') Process finished with exit code 255

This error is due to the fact that the index git-aoc_chaoss_enriched is empty (0 0 and the sizes). Please check the comment https://github.com/chaoss/grimoirelab/issues/285#issuecomment-590061923 and consider to submit a PR to sirmordred to improve the troubleshooting section (https://github.com/chaoss/grimoirelab-sirmordred#troubleshooting)

kshitij3199 commented 4 years ago

Thankyou very much @valeriocos, finally it worked :smile: and I was able to see graphs,data etc in kibana. Also I will be sending a PR to improve troubleshooting section.

But I am still getting this error, I have deleted and pasted new aliases.json file

Something went wrong when adding an alias on http://localhost:9200/git_chaoss. Alias not set. 400 Client Error: Bad Request for url: https://admin:admin@localhost:9200/_aliases

Also there is one more issue, I forgot to mention

Error enriching ocean from git (https://github.com/chaoss/grimoirelab-perceval): unhashable type: 'dict' Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/grimoire_elk/elk.py", line 597, in enrich_backend elastic_enrich = get_elastic(url_enrich, enrich_index, clean, enrich_backend, es_enrich_aliases) File "/usr/local/lib/python3.6/dist-packages/grimoire_elk/utils.py", line 260, in get_elastic analyzers=analyzers, aliases=es_aliases) File "/usr/local/lib/python3.6/dist-packages/grimoire_elk/elastic.py", line 110, in init self.add_alias(alias) File "/usr/local/lib/python3.6/dist-packages/grimoire_elk/elastic.py", line 239, in add_alias if aliases and alias in aliases: TypeError: unhashable type: 'dict'

valeriocos commented 4 years ago

Also there is one more issue, I forgot to mention

I'll try to replicate the problem on my machine

Can you share your setup.cfg, projects.json (just the part about git and general will be good) and the version of elasticsearch you are using? Thanks

kshitij3199 commented 4 years ago

Hi @valeriocos ,

I have used this version elasticsearch==6.3.1 elasticsearch-dsl==6.3.1 file-read-backwards==2.0.0

setup.cfg file

[general] short_name = Grimoire update = false min_update_delay = 10 debug = true logs_dir = logs bulk_size = 100 scroll_size = 100 menu_file = ../menu.yaml aliases_file = ../aliases.json

[git] raw_index = git_chaoss enriched_index = git_chaoss_enriched latest-items = false category = commit studies = [enrich_demography:git, enrich_areas_of_code:git, enrich_onion:git]

projects.json file

{
    "grimoire": {
        "git": [
            "https://github.com/chaoss/grimoirelab-perceval"
        ],
        "cocom": [
            "https://github.com/chaoss/grimoirelab-toolkit"
        ],
        "colic": [
            "https://github.com/chaoss/grimoirelab-toolkit"
        ],
        "*github": [
            "https://github.com/chaoss/grimoirelab-perceval"
        ],
        "*github:pull": [
            "https://github.com/chaoss/grimoirelab-perceval"
        ],
        "github:repo": [
            "https://github.com/chaoss/grimoirelab-perceval"
        ],
        "jenkins": [
           "https://build.opnfv.org/ci"
        ],
        "gitlab:issue": [
            "https://gitlab.com/gitlab-org/gitlab-ce"
        ],
        "gitlab:merge": [
            "https://gitlab.com/gitlab-org/gitlab-ce"
        ]
    }
}
valeriocos commented 4 years ago

ok, thanks! the docker compose is this one: https://github.com/chaoss/grimoirelab-sirmordred#source-code-and-docker?

kshitij3199 commented 4 years ago

@valeriocos I have changed elasticsearch User and password for kibiter

ELASTICSEARCH_USER=admin
 ELASTICSEARCH_PASSWORD=admin

please find complete file here docker-compose.yml

kshitij3199 commented 4 years ago

Hi @valeriocos, I have send a PR based on what you have said. Please have a look at https://github.com/chaoss/grimoirelab-sirmordred/pull/418 and https://github.com/chaoss/grimoirelab-sirmordred/pull/419

Please check the comment #285 (comment) and consider to submit a PR to sirmordred to improve the troubleshooting section (https://github.com/chaoss/grimoirelab-sirmordred#troubleshooting)

valeriocos commented 4 years ago

Thanks @kshitij3199 for the PRs

valeriocos commented 4 years ago

https://github.com/chaoss/grimoirelab/issues/285#issuecomment-597698114

I'm not able to replicate your issue, I believe you are using an old version of ELK and not the one in master. From the log you posted at https://github.com/chaoss/grimoirelab/issues/285#issuecomment-597661939, I see that there are some inconsistencies between the calls in your code and the ones in the master branch. For instance:

kshitij3199 commented 4 years ago

@valeriocos , I am using grimoire-elk version 0.63.0

valeriocos commented 4 years ago

it's a bit old :) the last one is 0.70.0: https://github.com/chaoss/grimoirelab-elk/commit/6f449a69d8730d88712bb6332227808a6bdadd4a. Can you try if with this version the error is gone? thanks

kshitij3199 commented 4 years ago

@valeriocos , I have tried downloading the latest version with pycharm package installer and with terminal but they are not showing any package above, 0.63.0. Any other way to download it. pip3 install grimoire-elk==0.70.0 Could not find a version that satisfies the requirement grimoire-elk==0.70.0 (from versions: 0.20rc1, 0.22, 0.22.1, 0.26.5, 0.30.4, 0.30.7, 0.30.8, 0.30.9, 0.30.11, 0.30.13, 0.30.18, 0.30.22, 0.30.23, 0.30.24, 0.30.27, 0.30.30, 0.30.33, 0.30.37, 0.30.39, 0.30.48, 0.30.51, 0.30.53, 0.31.0, 0.31.4, 0.32.0, 0.36.0, 0.47.0, 0.55.0, 0.58.0, 0.62.0, 0.63.0) No matching distribution found for grimoire-elk==0.70.0 Screenshot from 2020-03-12 16-10-07

valeriocos commented 4 years ago

The lastest version isn't available on pip. Please follow the instructions at https://github.com/chaoss/grimoirelab-sirmordred#setting-up-a-pycharm-dev-environment, so you can use the code in the master branch, thanks

vchrombie commented 4 years ago

Hi @kshitij3199

@valeriocos , I have tried downloading the latest version with pycharm package installer and with terminal but they are not showing any package above, 0.63.0. Any other way to download it.

You can use the Project Structure to add the repository.

kshitij3199 commented 4 years ago

Thankyou @valeriocos @vchrombie for your help, Now env is correctly set up as the code is running without any error or warning :smile:

Soniyanayak51 commented 4 years ago

Hi, I am Soniya Nayak and I am an Outreachy applicant 2020. This project looks very interesting and I'm looking forward to contributing here!

valeriocos commented 4 years ago

Hi @Soniyanayak51, welcome on board! Please have a look at the microtasks and don't hesitate to write if you need help.

kshitij3199 commented 4 years ago

Hi @valeriocos , I am currently working on test_git.py file to increase the coverage of git.py file. In https://github.com/chaoss/grimoirelab-elk/blob/master/tests/test_git.py file, some tests like test test_refresh_identities and test_refresh_project are not written completely. Any specific reason for that ?

valeriocos commented 4 years ago

Thank you @kshitij3199 for working on this, there is no specific reason. Please note that you can complete the tests by querying the enriched index and check that the data is correctly stored (you can mimic the code at https://github.com/chaoss/grimoirelab-elk/pull/801/files#diff-c12d8b17feda020355ff7084da770c2bR105)

kshitij3199 commented 4 years ago

Hi @valeriocos, In the test_git.py file, we don't have tests that checks areas_of_code and git_branches methods. So can we add tests for this methods to increase the coverage of git.py file?

valeriocos commented 4 years ago

Hi @kshitij3199 , good idea! thanks! Maybe you can start with areas of code, the test should be similar to https://github.com/chaoss/grimoirelab-elk/blob/master/tests/test_git.py#L236

kshitij3199 commented 4 years ago

Hi @valeriocos , I am writing test for areas_of_code, and I am facing a issue.

Traceback (most recent call last): File "test_git.py", line 291, in test_enrich_areas_of_code study(ocean_backend, enrich_backend) File "../grimoire_elk/enriched/git.py", line 543, in enrich_areas_of_code for source in self.json_projects.values(): AttributeError: 'NoneType' object has no attribute 'values'

the issue is that value of self.json_projects is None

This is the part of code

def test_enrich_areas_of_code(self):

""" Test that areas of code works correctly"""

    study, ocean_backend, enrich_backend = self._test_study('enrich_areas_of_code')
    with self.assertLogs(logger, level='INFO') as cm:

        if study.__name__ == "enrich_areas_of_code":
            study(ocean_backend, enrich_backend)
heming6666 commented 4 years ago

Hi, I am Haiming Lin, a student at Tongji University. I am very interested in working on this idea.

I have a question when going through the code of sirmordred.py. As is shown below, it calls the execute_batch_tasks function with the same params twice. Is there any specific reason for that ?

if not self.conf['general']['update']:
    sleep_for = self.conf['sortinghat']['sleep_for'] if self.conf.get('sortinghat', None) else 1
    self.execute_batch_tasks(all_tasks_cls,
                                sleep_for,
                                self.conf['general']['min_update_delay'])
    self.execute_batch_tasks(all_tasks_cls,
                                sleep_for,
                                self.conf['general']['min_update_delay'])
    break

Thanks!

valeriocos commented 4 years ago

https://github.com/chaoss/grimoirelab/issues/285#issuecomment-600974355

Thank you @heming6666 for your interest. That part of the code isn't really used, since the attribute update in setup.cfg is generally set to True. Please open an issue on sirmordred, and we can move the discussion there, thanks!

valeriocos commented 4 years ago

https://github.com/chaoss/grimoirelab/issues/285#issuecomment-600789030

@kshitij3199 can you share your code by opening a pull request on ELK (I'll try to reproduce the error)? Thanks

kshitij3199 commented 4 years ago

Hi @valeriocos , I have send a pull request chaoss/grimoirelab-elk#811. please check