Question: which backend file should I change to stop loading GitHub usernames and e-mail?

ghost commented 6 years ago

Ls,

I would like to try some modifications, to for example loading GitHub data. I would like to stop loading usernames and e-mail addresses. Which file would I need to change, and would you have an example? Or is this not possible somehow?

acs commented 6 years ago

@robinmuilwijk e-mail addresses are not included in the enriched indexes for privacy issues. Do you want to stop their loading completely?

ghost commented 6 years ago

I should have mentioned Git commits, this includes name and e-mail.

My setup is based on p2o, grimoire-elk and perceval, nothing more. I would like to be able to remove names, e-mail etc, all privacy data basically (not load them at all). Is that possible somehow?

acs commented 6 years ago

@robinmuilwijk but not load them at all is that the privacy data does not appear in any place in GrimoireLab (including raw indexes, sortinghat and enriched indexes). Right?

ghost commented 6 years ago

Correct, the reason for this, is the new GDPR by the EU. Ideally, I would keep the email domain, so I can see if the person is company or community, but nothing else of privacy data.

acs commented 6 years ago

Correct, the reason for this, is the new GDPR by the EU. Ideally, I would keep the email domain, so I can see if the person is company or community, but nothing else of privacy data.

I supposed it was related to GDPR. We are also analyzing the implications of the new regulation.

Answering your question, the complete solution will be hard to implement. The first step is avoiding to collect all this information at perceval level, so all the perceval backends must be reviewed to drop all the privacy related fields.If this is done at perceval level, all the data from GrimoireLab will be free from privacy concerns. But it is not an easy task.

What are your plans related to this issues? Do you have resources for playing with some prototypes?

ghost commented 6 years ago

I was afraid it would not be that easy to fix, I dug around most of the perceval and grimoire-elk scripts already.

I somehow need to get the entire dashboard comply to GDPR. Asking for consent is probably even harder, and faces other challenges like removing data on request. Hence my 'path' e.g. choice to not load privacy data at all.

I only have a production dashboard, but I could setup a test environment again. I'd be happy to help in anyway with regards to the GDPR. I think it would be great if GrimoireLab could me made GDPR compliant by Bitergia, not just for me, but for other users/customers also.

acs commented 6 years ago

Yes, we are pretty interested in making GrimoireLab GDPR ready, but not sure when we can start working on it. This is why I asked to you about your plans to try to join efforts. Have you clear what are the requirements of GDPR for GrimoireLab? Probably the first step is to define a plan for adopting GDPR.

ghost commented 6 years ago

I am clear on the GDPR basics, but an 'architect' on your end would also need to check the entire Grimoirelab stack. And you would probably need a lawyer at some point for the GDPR details. I think you would first need to check on this subject internally, then let me and others know if we can help.

acs commented 6 years ago

@robinmuilwijk we are working now with our legal team to implement the GDPR. Stay tuned!

ghost commented 6 years ago

@acs sounds great! Do let me know if I can help reviewing, testing or anything else. I am very interested in having a GDPR compliant version of GrimoireLab.

acs commented 6 years ago

@acs sounds great! Do let me know if I can help reviewing, testing or anything else. I am very interested in having a GDPR compliant version of GrimoireLab.

Great! We will inform, probably in the blog, about the progress.

jgbarah commented 6 years ago

While this information is provided, some comments, @robinmuilwijk:

The status of information made completely public by their "owners" is not clearly described (in my opinion) in GDPR. At least to some extent, it could be understood that when you include an email address in a git commit, your are consenting to everybody downloading that commit to their git repos, to see it everytime the git log, or git blame, or git log | grep "email address" | wc. Otherwise, the only possible conclussion is that GitHub is in bit trouble, and that you cannot git clone and then the previous command. To which extent doing more general analytics on this data can be considered the same, is something I don't know, and more knowledgeable people whom I've asked cannot say for sure, either. It seems GDPR just didin't have this kind of data in mind. But this said, all of it amount (from my point of view), that the most sensible approach is to try to ensure that data is used as if no explicit consent was given, if possible, or that consent was given when the data was provided, but other regulations of the GDPR apply (such as data removal). So, in the rest, I enter the more (for me) safe technical ground, with this latter approach in mind.
The first step to consider is Perceval. It mines data from the original API, as such, and therefore if it includes some privacy-related field, those are retrieved as well. From my understanding of the GDPR, this is not a problem in itself. Storing that information beyond needs, and using the information in a way that can lead to identifying people, could be. So, I think that an approach where there is a clear policy on data retention, which is short enough to make it clear that the data is only used for extracting the (non-GDPR affected) needed data, would be GDPR compliant. In an extreme, it would be just mining, using the information needed for enrichment (ignoring all privacy-related data), keeping what may be needed for incremental retrieval, and just deleting the rest. Alternatively, all data produced by Perceval could be stored, but with a clear policy for completely deleting it x time later (being x the time needed to ensure that the data was correctly processed).
Sorting Hat would be completely avoided in this scenario.
GrimoireELK could work with no trouble, since it would only use for enrichment data not related to privacy, producing "privacy-data free" indexes.
Specific panels, using only data in those "privacy-data free" indexes should be produced, likely based on the standard ones, removing the tables and visualizations that use privacy related data.
The rest of the tool chain would need no changes.

The tricky part here could be to find out which fields are "safe" for those "privacy-data free" indexes. We should for example avoid email addresses, user names and names. But could we use pseudo-anonymized fields? And gain here, the fact that we're working with public data makes things funny. If you select any id, anonymous or not, for the author of a commit, and include it in an index, it is trivial to know all the data for that author: it is enough to look for the commit in the original git directory. that's why I would just use a hash of the original id, which would be good enough to be sure that the id cannot be deanonymized with only the data in the index, but at the same time easy to compute, to keep invariants such as being able of counting contributions by author.

If we agree on all of this, I would start by adding a '--no-privacy-data' to p2o (and related options to GrimoireELK classes and methods), implementing the above. Once that is done, we can either produce a new version of the panels for privacy-concerned dashboards, or add a new version in Kidash that, when loading a panel, checks for the existence of the needed fields, and doesn't upload the visualization if not all of them are present.

Of course, there other aspects to GDPR to be taken into account. Some are operational, and are more in the realm of policies of the operator (data controller) than on GrimoireLab itself. Some others will require help form the tooling, such as removing data related to an identity, and maybe we should include some support for that in the form of scripts or options to current commands.

jgbarah commented 6 years ago

BTW, I'm not a lawyer, you know, but there are many people out there with similar concerns, related either to open data or to data "made public by the data subject". For example:

Art 9 of the GDPR. Note exception (e), "processing relates to personal data which are manifestly made public by the data subject;", and (only maybe, assuming FOSS projects and/or FOSS foundations enter in this category) (d) "processing is carried out in the course of its legitimate activities with appropriate safeguards by a foundation, association or any other not-for-profit body with a political, philosophical, religious or trade union aim and on condition that the processing relates solely to the members or to former members of the body or to persons who have regular contact with it in connection with its purposes and that the personal data are not disclosed outside that body without the consent of the data subjects".
Personal data made public by the ‘data subject’ and use of information published on social networks: early observations of GDPR art. 9, para. 2, letter e)
GDPR vs. (part of) open data

ghost commented 6 years ago

@jgbarah thanks for the information, the --no-privacy-data option would be very interesting to have. As you say, some of it lies with the data controller. And I'll have a look at that art. 9. That looks interesting.

I'll keep an eye on the blog for any further updates.

valeriocos commented 4 years ago

A PR (https://github.com/chaoss/grimoirelab-perceval/pull/580) is under evaluation to remove data retrieved from the /users github endpoint.

jjmerchante commented 12 months ago

We created a new option some years ago to avoid fetching user information from GitHub API. You can check it out by running the following command:

perceval github chaoss grimoirelab --api-token=xxxx --filter-classified --no-archive

If you want to use GrimoireLab to fetch data from GitHub with anonymized user information, in sirmordred, you need to include anonymize = true alongside with filter-classified = true in the github section. For example, for issues:

[github:issue]
api-token = xxxx
raw_index = github_raw_index
enriched_index = github_enrich_index
category = issue
no-archive = true
filter-classified = true
anonymize = true

Closing the issue, reopen it if you have any other questions.

chaoss / grimoirelab-elk

Question: which backend file should I change to stop loading GitHub usernames and e-mail? #236