chaoss / grimoirelab-elk

GNU General Public License v3.0
59 stars 120 forks source link

Question: which backend file should I change to stop loading GitHub usernames and e-mail? #236

Closed ghost closed 12 months ago

ghost commented 6 years ago

Ls,

I would like to try some modifications, to for example loading GitHub data. I would like to stop loading usernames and e-mail addresses. Which file would I need to change, and would you have an example? Or is this not possible somehow?

acs commented 6 years ago

@robinmuilwijk e-mail addresses are not included in the enriched indexes for privacy issues. Do you want to stop their loading completely?

ghost commented 6 years ago

I should have mentioned Git commits, this includes name and e-mail.

My setup is based on p2o, grimoire-elk and perceval, nothing more. I would like to be able to remove names, e-mail etc, all privacy data basically (not load them at all). Is that possible somehow?

acs commented 6 years ago

@robinmuilwijk but not load them at all is that the privacy data does not appear in any place in GrimoireLab (including raw indexes, sortinghat and enriched indexes). Right?

ghost commented 6 years ago

Correct, the reason for this, is the new GDPR by the EU. Ideally, I would keep the email domain, so I can see if the person is company or community, but nothing else of privacy data.

acs commented 6 years ago

Correct, the reason for this, is the new GDPR by the EU. Ideally, I would keep the email domain, so I can see if the person is company or community, but nothing else of privacy data.

I supposed it was related to GDPR. We are also analyzing the implications of the new regulation.

Answering your question, the complete solution will be hard to implement. The first step is avoiding to collect all this information at perceval level, so all the perceval backends must be reviewed to drop all the privacy related fields.If this is done at perceval level, all the data from GrimoireLab will be free from privacy concerns. But it is not an easy task.

What are your plans related to this issues? Do you have resources for playing with some prototypes?

ghost commented 6 years ago

I was afraid it would not be that easy to fix, I dug around most of the perceval and grimoire-elk scripts already.

I somehow need to get the entire dashboard comply to GDPR. Asking for consent is probably even harder, and faces other challenges like removing data on request. Hence my 'path' e.g. choice to not load privacy data at all.

I only have a production dashboard, but I could setup a test environment again. I'd be happy to help in anyway with regards to the GDPR. I think it would be great if GrimoireLab could me made GDPR compliant by Bitergia, not just for me, but for other users/customers also.

acs commented 6 years ago

Yes, we are pretty interested in making GrimoireLab GDPR ready, but not sure when we can start working on it. This is why I asked to you about your plans to try to join efforts. Have you clear what are the requirements of GDPR for GrimoireLab? Probably the first step is to define a plan for adopting GDPR.

ghost commented 6 years ago

I am clear on the GDPR basics, but an 'architect' on your end would also need to check the entire Grimoirelab stack. And you would probably need a lawyer at some point for the GDPR details. I think you would first need to check on this subject internally, then let me and others know if we can help.

acs commented 6 years ago

@robinmuilwijk we are working now with our legal team to implement the GDPR. Stay tuned!

ghost commented 6 years ago

@acs sounds great! Do let me know if I can help reviewing, testing or anything else. I am very interested in having a GDPR compliant version of GrimoireLab.

acs commented 6 years ago

@acs sounds great! Do let me know if I can help reviewing, testing or anything else. I am very interested in having a GDPR compliant version of GrimoireLab.

Great! We will inform, probably in the blog, about the progress.

jgbarah commented 6 years ago

While this information is provided, some comments, @robinmuilwijk:

The tricky part here could be to find out which fields are "safe" for those "privacy-data free" indexes. We should for example avoid email addresses, user names and names. But could we use pseudo-anonymized fields? And gain here, the fact that we're working with public data makes things funny. If you select any id, anonymous or not, for the author of a commit, and include it in an index, it is trivial to know all the data for that author: it is enough to look for the commit in the original git directory. that's why I would just use a hash of the original id, which would be good enough to be sure that the id cannot be deanonymized with only the data in the index, but at the same time easy to compute, to keep invariants such as being able of counting contributions by author.

If we agree on all of this, I would start by adding a '--no-privacy-data' to p2o (and related options to GrimoireELK classes and methods), implementing the above. Once that is done, we can either produce a new version of the panels for privacy-concerned dashboards, or add a new version in Kidash that, when loading a panel, checks for the existence of the needed fields, and doesn't upload the visualization if not all of them are present.

Of course, there other aspects to GDPR to be taken into account. Some are operational, and are more in the realm of policies of the operator (data controller) than on GrimoireLab itself. Some others will require help form the tooling, such as removing data related to an identity, and maybe we should include some support for that in the form of scripts or options to current commands.

jgbarah commented 6 years ago

BTW, I'm not a lawyer, you know, but there are many people out there with similar concerns, related either to open data or to data "made public by the data subject". For example:

ghost commented 6 years ago

@jgbarah thanks for the information, the --no-privacy-data option would be very interesting to have. As you say, some of it lies with the data controller. And I'll have a look at that art. 9. That looks interesting.

I'll keep an eye on the blog for any further updates.

valeriocos commented 4 years ago

A PR (https://github.com/chaoss/grimoirelab-perceval/pull/580) is under evaluation to remove data retrieved from the /users github endpoint.

jjmerchante commented 12 months ago

We created a new option some years ago to avoid fetching user information from GitHub API. You can check it out by running the following command:

perceval github chaoss grimoirelab --api-token=xxxx --filter-classified --no-archive

If you want to use GrimoireLab to fetch data from GitHub with anonymized user information, in sirmordred, you need to include anonymize = true alongside with filter-classified = true in the github section. For example, for issues:

[github:issue]
api-token = xxxx
raw_index = github_raw_index
enriched_index = github_enrich_index
category = issue
no-archive = true
filter-classified = true
anonymize = true

Closing the issue, reopen it if you have any other questions.