donders-research-data-management / rdm-wiki

Technical documentation for RDM
http://donders-research-data-management.github.io/rdm-wiki
1 stars 2 forks source link

Feature request: DR web-hook for automatic updates from GitLab to an RDC? #61

Open JoKeyser opened 6 years ago

JoKeyser commented 6 years ago

I'm using Git and gitlab.socsci.ru.nl as primary way to manage my research projects. Over an RDC, this has the benefit of retaining the full history, simpler authentication with SSH keys rather than the one-time passwords (and nice things like an issue tracker).

To comply with the DR requirements, it would be great if I could set up my project to push to my RDC. This seems feasible, for example through GitLab's continous deployment via webhooks. Would this be possible and desirable?

robertoostenveld commented 6 years ago

This is an interesting idea that would also be of value to others.

However, I don't think it would work without changes to the low-level authentication system. Writing files to a collection in the DR requires that you authenticate. Shared keys SSH authentication is not possible. The authentication is buit on surfconext, which is http and user-interaction based. The reason that we have the one-time-passwords for webdav (copying files) is also because of that limitation.

Right now I cannot think of a system to get files into the DR except for the stager that @hurngchunlee implemented for the data flow of the DCCN scanners. That uses the System Administrator account from a secured server that has special access to the backend server. But as a regular user you cannot affect the rules/scripts that run on that server.

But to switch gears: why would you want to do daily/frequent synchonisation of your git code? Policy wise it would be fine if you would upload the git repository only once, just prior to collection closure. After all, the DR is an archive for persistent long-term storage and collections are to be closed before they become persistent. As long as the collection is open, you can do whatever you want with your git repo, like reverting commits (and nobody would be able to tell). The DR itself is not a version control system for the files that are stored in it.

robertoostenveld commented 6 years ago

PS I suggest that this message (and the associated thread) are sent to the datasupport team, so that they get forwarded to the right people. @hurngchunlee and @paulspeijers: how would this type of discussion with users go? JIRA would not work, right? Would Jo have to file a topdesk ticket? Or would he have to go through multiple layers (Miriam as data steward -> Hong -> JIRA)?

JoKeyser commented 6 years ago

@robertoostenveld thanks a lot for your assessment and clarifications. Indeed, I misunderstood the policy, and thought that a continuous updating is required. So for me personally, the issue got less 'severe' - I can live with manual uploading to DR a handful of times.

An automated file update via GitLab would be useful in case the RDC is used to collaborate with other people. That would combine the 'speed of Git' control with sharing the data of a project.

Technically, if I understand correctly, GitLab's hook could trigger another privileged machine with access to the DR. Maybe it's better though to adapt (to) the Surfconext authentication.

Let me know if I should post this elsewhere.

hurngchunlee commented 6 years ago

@JoKeyser As far as I understood, the webhook requires a web service (i.e. as service that talks HTTP protocol). The "privileged" machine you mentioned should be able to accept a HTTP request from GitLab. When the request is received, it's up to the privileged server to handle it. In your user case, the privileged machine will have to:

  1. authenticate user to DR
  2. checkout out codes from GitLab
  3. update codes to a RDC collection.

The point I want to make is that you will need to develop a service (feature) doing that specific purpose even you have the privilege to retrieve the one-time password for any user in DR. Furthermore, if the service should operate for multiple users. It also needs to maintain the mapping between GitLab account and DR account so that the service knows on behalf of whom it should upload data into a RDC collection.

Currently there is no such feature in the DR system, but I think it is a interesting request to follow. Thus, I think it is ok to send request to the data support helpdesk: datasupport@donders.ru.nl

robertoostenveld commented 6 years ago

@JoKeyser how did you come up with the idea that continuous updating would be required? That is an interesting remark which relates to a recent discussion with @hadrianswall (Eric Maris).

If you want to share/publish code, I suggest you look at zenodo which has very good github integration. You can link your RDC to the persistent identifier of the zenodo collection (in DR "edit" -> "associated publications")

JoKeyser commented 6 years ago

@hurngchunlee yes that's what I have in mind. I think/hope it would be relatively straight-forward to glue it together like that. I was hoping the mapping from GitLab-to-DR accounts could be offloaded, maybe through ldap... but I have no real idea about the interfaces involved. Should I send a fresh summary to the support helpdesk, including a link to this discussion here?

@robertoostenveld I guess I assumed that the continuous updating was required because a) I'm primed by git, b) I think one aspect of the DR is to increase accountability/'checkability' in case of suspected fraud, and c) because of the emphasis on process in the DCC's SOP, e.g. "[an RDC] documents the process via which data are converted into published results." Thanks for the hint about zenodo, I'll check it out.

paulspeijers commented 6 years ago

Hi @robertoostenveld, since this would be a new feature we should assess whether to support it, and if so, how that should work. I think it should start by the product owners giving it the right priority on the backlog, and then we can refine it as usual. Then it would probably be good to invite @JoKeyser to the session.

robertoostenveld commented 6 years ago

I guess that means that the responsible data steward would put it on the backlog. In this case it would be the DCC data steward, i.e. Miriam. I cannot include her on this thread, since she seems not to be on github (at least not in this organization).

@JoKeyser please send this thread (e.g. as a pdf or ascii dump) to Miriam and ask her to put it on the "backlog". If she does not know how to do that, she should contact @paulspeijers for instructions and setup.

JoKeyser commented 6 years ago

Okay great that this is gonna be assessed. @robertoostenveld Okay I've forwarded the request to Miriam, thanks for the clarification who to ask. @paulspeijers I'm glad to help if I can.