UtrechtUniversity / yoda

A system for reliable, long-term storing and archiving large amounts of research data during all stages of a study.
https://utrechtuniversity.github.io/yoda/
GNU General Public License v3.0
44 stars 26 forks source link

[FEATURE] Feeding Yoda through an 'external website' #352

Closed Jos-London closed 2 months ago

Jos-London commented 7 months ago

Is your feature request related to a problem? Please describe.

Use case: Researchers often use a (external website) to collect data - for instance through questionaires. This data should be automatically uploaded to Yoda from the website on a regular basis. The data is then deleted from the website environment so that research data is stored as much as possible within the secure Yoda environment.

Describe the solution you'd like

The common solution is to (automatically) write data into Yoda from the website with a so-called 'service account'. However, this also introduces a security risk. The service account created for this purpose can be hacked (in the website, or misused by the site developer). And the service account actually has all rights, so read, write, delete......

A possible solution to mitigate the risk is to give the service account write permissions only to a specific folder.

Solutions may also be possible with a pull mechanism from Yoda, but this has not been further investigated.

Describe alternatives you've considered

...

Additional context

This type of functionality is widely requested and this situation occurs in many research situations. The use case can in fact apply to any external environment that wants to push data to Yoda.

Danny-dK commented 7 months ago

I know far too little about this but does this not depend on the chosen software / platform that collects data? For example for questionnaires (for example MS forms, LimeSurvey, Qualtrics), the platform chosen should adhere to privacy and security factors that limit what other platforms and software can do and extract. Those platforms can be accessible by the one responsible for the data in question, can make an export of the data, and upload to where-ever. If the chosen platform can be accessed through CLI (through an API), then users should be able to write a script to securely access the data on the questionnaire platform with their own credentials (not by hardcoding credentials), export the data, and then connect to Yoda using GoCommands or iCommmands to upload their data (both could be done through Python or R). This was something regularly done in the YOUth project using LimeSurvey (extract the data through their API, quality check the data through set scripts and criteria, then upload data to Yoda; all with proper and safe authentication of users).

My concerns would be for what type of data should this be made possible (what data classifications / sensitivity) and who would check such things (so that it does not get misused just because it might be a easier solution). Also I'm not particularly for access to Yoda using no authentication to push data (that pushing could be hijacked)? The intend of Yoda was that actions can always be related back to individuals, which is not feasible with service accounts.

erikvdbergh commented 7 months ago

We would also be interested in similar functionality, specifically for eLabjournal. Right now a lot of research data ends up sticking around in eLabjournal, while this is not a FAIR data repository at all. Therefore this data ends up being at risk of getting lost. I see the wanted functionality similar to Jos: A elab experiment can be added to Yoda to be 'watched' and if there is data deposited or changed to the repository the corresponding Yoda folder is updated. Essentially the workflow Danny describes, but instead of doing it yourself (which most users can't / won't do) it is integrated into Yoda.

I agree that classification can be an issue but I would say it is the responsibility of the user to maintain this in various platforms. Ideally this is read from the originating application but this is likely to be technically difficult due to differing security models between applications.

This functionality would be a great benefit to researchers, as there will always be large platforms outside of Yoda that hold research data, which are often not FAIR enabling at all but still crucial in the research process.

Danny-dK commented 7 months ago

I would still be hesitant of automated updating of data to Yoda. Primarily as you are then using two systems to store data, which will not be energy efficient. I would rather see a prompt / button that the user has access to for sending data to a specific research folder at key moments that the user themselves decide and only upon proper credentials provided. I'm also still not sure about prolonged 'watching' and updating data without user credential provisioning.

Sending something to Yoda from a different application, will still not make it really FAIR. The FAIRness even in Yoda will depend on whether the user will take time in adding metadata, whether or not they will submit for vault, and whether they publish data or metadata. As you indicate that "but instead of doing it yourself (which most users can't / won't do)", I don't see why users would be motivated to do the extra tasks in Yoda when they are not motivated to make a simple export of eLabJournal data and upload to Yoda and do the extra Yoda tasks.

One option that I could imagine is the single stop deposit module that exists (but not active for general Yoda) in which users can deposit data immediately in a specified Yoda Vault. Perhaps only 1 time a metadata form needs to be filled in and then each time a button is pressed the user can submit that data directly to the vault without constantly having to enter new metadata. That will rack up vault storage space though as vault entries cannot be overwritten by users.

peer35 commented 6 months ago

You can already automatically write data to Yoda using irods: just create a custom script using the irods Python client for your application that puts the data in Yoda. Create an irods user which only has write access to a single subfolder in your project folder to authenticate. The script should run as a separate cron job on a server and not be called directly from a web service further limiting the potential for hacking.

Your IT research support department should offer support to researchers that want to implement such a workflow.

I struggle to see what functionality would need to be added to Yoda to further facilitate this, since the workflow will always have to be custom made for each application or measurement instrument. Maybe some sort of "ingest space" accessible via webdav so a researcher would not have to use irods scripting? For the moment I'm personally fine with assisting researchers at the VU that have a need for automatic uploading to create a script.

RobvanSchip commented 2 months ago

As @peer35 mentioned you can already automatically write data to Yoda using iRods.