Add DataTypes - Githubissues

madprime commented 5 years ago

Description

Currently management of data authorizations is performed according to "data source" - that is, the project that produced the data. This is limiting! We believe some representation of "type" of data, and enabling members and projects to manage permissions according to this, would improve usability of the platform.

Some examples of why we need "data types"

Granular permission Project A generates survey + sensitive data (e.g. genetic) and Project B just wants access to one "type", not both.
Shared permission Project A generates data of type X. Project B also generates data of type X. Project C would like to get data of type X -- but the only way to do this currently is to ask for each project (Project A and Project B).
Future-friendly authorization Project D adds data of type X. In the current model, Project C now needs to add "Project D" to its list of authorizations. Furthermore, members need to re-authorize Project C to grant that access.

Specs

We imagine that a "DataType" could exist that has the following features:

unique label/identifier
description
parent DataType (can be null)

We won't support multiple inheritance.

Furthermore, the project itself is a final characterization of a data type. We might think of these as terminal nodes in some tree ontology, e.g. "genomic data:23andMe data:data from direct-sharing-128". But it might not be necessary to do more than consider an optional specification of "project source" alongside "data type" in authorizations.

Data source behavior

Registration of data types Projects should register DataTypes during configuration; these will be reviewed as part of the approval process.
API enforcement The API should require file uploads to have at least one DataType. They may have more than one. All DataTypes must match ones that were registered by the project.
Re-approval Changing registration of DataTypes requires re-review.

The rational for requiring re-approval is to reduce the chances of a project mistakenly using DataTypes in a manner which causes undesired data sharing (e.g. mislabeling GPS data as survey data, rendering it available to a project that requested survey data).

Authorization behavior

Projects may request one or more DataTypes. Each DataType request may optionally be specific to a certain project source. For example:

genetic data (any source)
survey data -- from "Harvard PGP" source

Project source is optional. Unlike the current behavior, authorization will never be according to project source alone.

Data with multiple types assigned is authorized if any of those types is requested.

As is currently done, Member authorization is for what was requested at the time of joining a project. Re-authorization is needed for new requests.

mldulaney commented 5 years ago

So, I'm thinking we have a tree it collapses down to some number of base types this lowest level is editable only by OH staff you click one, it expands and, so on, through to the leaves Each node after these base types gets a checkbox if you click the checkbox, then two things happen: First, a new leaf is created under that node just for the project (this is automatic, as we discussed) secondly, that type gets assigned to the files There will also be a link somewhere outside the tree something like "Don't see your type? Create one!" you get presented with the tree again, except without the checkboxes you choose which node is the parent of your new type You then fill out a form (contents of which tbd) to specify the type Returns you to the intial tree if you want to have files of disparate types once you leave the tree, you get asked if you do, if yes, then you get a new tree if no, project is created. so, one question if a project does not intend to actually upload data should that not be indicated by choosing 'explore and share' and, if only explore and share is chosen, then adding data for the project is disabled? if explore and share /and/ add data are both chosen, then, of course, data can be added so, like, Kevin's project would be the latter (at the moment) also, btw, back to the leafs; a leaf inherits all nodes upstream of it madprime [12:17 PM]

First, a new leaf is created under that node just for the project

as I wrote up in the issue, I don't see that this needs to be captured in the DataType ontology. Every file is a combination of DataType and project source, and when requesting authorization, it can also optionally specify a project source for a given DataType.This produces the equivalent logic without mucking up the design of DataTypes.

secondly, that type gets assigned to the files

Does this view involve a file or files, somehow? Mairi [12:19 PM] Once a project sets their datatypes, then it gets applied to all (future only? apply retroactively?) the data_files that project 'owns' or, if multiple separate data types are specified then, we should provide an api with the upload api that allows the project to choose which ones apply to which files, from the list that they selected on project creation madprime [12:20 PM]

if only explore and share is chosen, then adding data for the project is disabled?

Projects should be allowed to add data without representing themselves as data sources in the "add data" list. But they should have DataTypes registered to be able to upload. That registration would require project re-review/approval if it wasn't initially present, but not member re-authorization. Mairi [12:20 PM] roger that madprime [12:21 PM] I expect the project to specify a DataType for each file when uploading the file, but we don't want to break existing projects... So maybe we also need a concept of default DataType. Mairi [12:21 PM] Well hrm madprime [12:21 PM] (But maybe that should only be used for existing projects, grandfathering them in this way.) Mairi [12:22 PM] Maybe when we go to turn this feature on, we should ping existing projects? madprime [12:22 PM]

provide an api with the upload api

you mean update the upload API to expect/accept datatype for each file, right? Mairi [12:23 PM] correct also, if we do a grandfather type, we should be the only ones allowed to set that madprime [12:24 PM] it seems simple to me. I don't know that we should allow editing of datatypes in general much, I don't favor some sort of "nodes at this tier are staff only" but maybe a policy that adds a lot of friction for... maybe, updates after a datatype is being used by an approved project. (is there some concrete story that gives rationale for having datatypes on one tier staff only and the others easily edited?) Mairi [12:25 PM] no I just worry :slightly_smilingface: madprime [12:26 PM] I think once something is "in use" it's dangerous to allow easy updates Mairi [12:26 PM] I seem to recall us discussing that the project approval process should include examining that datatypes were correctly setup Yeah, this would be set at project creation, and then project coordinators would be unable to directly edit again; they'd have to go through staff and, probably get community consensus madprime [12:27 PM] I don't think I'm contradicting that... what would be set at project creation? Mairi [12:27 PM] the datatypes madprime [12:27 PM] no, I would allow editing until a project is approved? Mairi [12:27 PM] ah okay madprime [12:29 PM] most of the mess might be handling grandfathering stuff in ¯_(ツ)/¯ Mairi [12:29 PM] yeah

madprime commented 5 years ago

Approach for grandfathering in past projects: Add a BooleanField only available to admin/staff auto_add_datatypes. If this is true, the API accepts an upload without the data types specified and assigns that file to have all data types registered for the project.

madprime commented 5 years ago

Note regarding permissions and multiple datatypes: Some files may have more than one type of data within them. Our decision is that a datafile with more than one "datatype" is authorized for sharing if someone authorized all relevant datatypes (either directly or via a parent datatype).

There is a potential confusion on the part of project creators that they are going to receive access to files they won't receive access to -- because they failed to specify all relevant datatypes. An alternative approach would be to allow one type request to be sufficient, but we're concerned this leads to inadvertent authorization (e.g authorizing "any demographic surveys" could authorize a file which has GPS data + demographic survey data bundled together). So instead, we would like to mitigate this risk in the design of the form a project uses to specify permissions, to help the project lead select these correctly.

OpenHumans / open-humans

Add DataTypes #981

Description

Some examples of why we need "data types"

Specs

Data source behavior

Authorization behavior