MIT-LCP / physionet-build

The new PhysioNet platform.

https://physionet.org/

BSD 3-Clause "New" or "Revised" License

56 stars 20 forks source link

Integrate local file access and research environments with the existing `DataAccess` model #1927

Open kshalot opened 1 year ago

kshalot commented 1 year ago

(I used orgmode to generate this so now it looks more serious than it actually is)

Overview
1. Use the existing DataAccess abstraction to get rid of implicit project access
2. Refactor

Overview

Use the existing `DataAccess` abstraction to get rid of implicit project access

Before refactoring the DataAccess functionality itself, the mechanism could be made generic (i.e. local file access and research environment access should be controlled by the DataAccess model). This will make any access via any source fall under the same mechanism so it’s easier to reason about a potential refactor in the future.

File downloads

DataAccess already contains a local platform which is unused.
A project being accessible via file download would be denoted as DataAccess(platform=0, location=None).
Steps:
1. Move the files panel into the published_project_data_access template.
2. Control displaying of the file panel via if item.platform == 0.
3. Check whether the local access is enabled before serving files/zip (technically, the Google Cloud Storage ProjectFiles backend should also support wget. This can be aligned later when needed).
4. Create a data migration/script (apply on PhysioNet only) that will create a local DataAccess for each PublishedProject.

Research environments

A project being accessible via a research environment would be denoted as DataAccess(platform=5, location="GCP group controlling access to the dataset").
Steps:
1. Add a research_environment platform to DataAccess.
2. (Optionally) The platform should be hidden unless ENABLE_CLOUD_RESEARCH_ENVIRONMENTS is set to True.
3. Add a new branch to published_project_data_access:
  1. Controlled by if item.platform == 5.
  2. Show a link “Create a cloud research environment”:
    - If the user did not finalize their CloudIdentity, direct them to the setup.
    - If the user already has a research environment for that dataset, direct them to the list of their environments/open their environment.
    - Otherwise, direct them to the research environment creation (for that dataset).
4. Inside the hdn-research-environments package - only display projects with an assigned research_environment DataAccess as available.
5. Create a data migration/script (apply on Health Data Nexus only) that will create a research_environment DataAccess for each PublishedProject.

Events

Events give access to a project via a specific DataAccess. This means that we have to start operating with higher granularity when it comes to access to data (i.e. not check whether someone is authorized, but specifically check PublishedProject#can_create_research_environment for example).
Technically, this implies that the current authorization mechanisms (for example required training), could be decoupled since now being authorized is one of the mechanisms to access the data and is not a requirement (because events go around it).

Defaults

Right now DataAccess has to be explicitly set by the administrator when publishing a project.
Both platforms have their “default” access (local for PhysioNet, research_environment for Health Data Nexus).
To not make the publishing process tedious and error-prone, the default DataAccess should be created automatically.
While this is trivial for the local setting, research_environment has to have a location set. The logic that generates those values is controlled by the external hdn_research_environment package (as it should).
To keep it simple, the research environment access’ location can be set to None. Then the application can do a simple check

Cleanup

DataAccess replaces the allow_file_downloads setting that can be now removed.
Proposal - DataAccess could be renamed to DataSource to distinguish it from things like DataAccessRequest, AnonymousAccess etc.

Refactor

After we align the logic to always use the same mechanism, we could start to think how to improve the existing model.
This is TBD (I’ll create a second issue when I have a concrete proposal).

bemoody commented 1 year ago

Currently DataAccess does not incorporate Google Cloud Storage (project.models.GCP).

Is your thought that there is one DataAccess representing "ability to view/download source files", or one DataAccess per storage backend?

kshalot commented 1 year ago

Currently? One DataAccess per backend, per project. Regarding Google Cloud Storage, from what I saw there is a separate model for it (as you pointed out project.models.GCP) but it’s also controlled via the DataAccess model with platform 3 (gcp-bucket): https://github.com/MIT-LCP/physionet-build/blob/cfbfdf32ac1572f67550c0641cd8d7e312b3be13/physionet-django/project/templates/project/published_project_data_access.html#L7-L10 project.gcp is a relation with project.models.GCP.

One architecture would be to have a similar relation would exist for other data accesses as well. I could imagine a polymorphic relation linking DataAccess to models specific to data source backends.

Also, naming the platform local is not really accurate now that I think about it, because we also have the GCS ProjectFiles backend that serves the files from a bucket. So direct or something along those lines is probably more accurate.

And stemming from that last point, there seems to be some redundancy here that ideally should be refactored - the GCS ProjectFiles backend and the project.models.GCP in theory both model accessing files via GCS. Their use cases are different, but at least the low-level mechanisms are very similar.

I was thinking that before considering this and other major refactors, we could at least make any data access go through the DataAccess model to make it easier to reason about when thinking about things like changes in the way access is managed in the platform. For example, materializing that access was granted to the user for a specific backend, will make it explicit who has access to what and how, while also integrating with events much easier (there are a lot of questions to answer in this approach though).

bemoody commented 1 year ago

Currently? One DataAccess per backend, per project. Regarding Google Cloud Storage, from what I saw there is a separate model for it (as you pointed out project.models.GCP) but it’s also controlled via the DataAccess model with platform 3 (gcp-bucket): https://github.com/MIT-LCP/physionet-build/blob/cfbfdf32ac1572f67550c0641cd8d7e312b3be13/physionet-django/project/templates/project/published_project_data_access.html#L7-L10

Ah, okay.

When we create a GCP bucket with "send files to GCP", this creates a GCP object. It does not create a DataAccess object.

There are no DataAccess objects (with platform=3) for open access projects. There are DataAccess objects for some (but not all) restricted projects. I guess that somebody (Tom? Felipe?) must be creating these by hand.

bemoody commented 1 year ago

Looking at this more, DataAccess is even more of a mess than I thought. "location" is sometimes a location and sometimes an access control group.

The names of the access groups are apparently chosen ad-hoc, & there can be multiple access groups for the same bucket.

I agree with the principle of trying to unify access methods through a single class, but I think it might be better to define a new class rather than trying to retrofit DataAccess.

kshalot commented 1 year ago

Yes, and adding the new accesses will introduce a third case, where location is blank. I'm all for creating a new class - at first I though it would be simpler to reason about the changes if we align everything and then refactor it, but that's backwards. After looking at it closer and your comments @bemoody I'd say we should:

Migrate the existing DataAccess usage to a new mechanism that we are happy with.
Migrate local and research environments access to the new model.

There's no point in introducing more technical debt now by thinking of ways to make it all fall together. I see a couple of options:

Using a JSON column

PostgreSQL has great support for JSON, so the location column could simply be replaced by a metadata column of the json/jsonb type. The obvious disadvantage is that the overhead of managing a column like this is higher since there are no guarantees in regard to its schema. In our case, there is currently no need to use this column for search (although it's possible to index such a column in case we need to search for particular key/value pairs).

Making the relations explicit

One simple solution is to have explicit relations for each of the existing/future DataSources. For example:

Denormalized drawio (1)

This can be a 1-to-many as well. Retrieving a project will be heavier since N tables will have to be joined instead of 1 (the original DataAccess). Doesn't seem like a huge issue though, in my opinion.

Polymorphic Data Source

This approach would be useful in case this relation was Many-to-Many, which might not be the case any time soon/ever.

Data Source Proposal drawio

The idea is to create a DataSource entity that would:

Have a composite primary key consisting of: a) The primary key of the PublishedProject entity. b) The primary key of the concrete DataSource entity (GCSBucket, ResearchEnvironment etc.) that could be autogenerated as coalesce(gcs_bucket_id, research_environment_id, ...)
Have a column for each of the possible concrete data sources: a) Enforce that one and only one of these column is present. For example via a constraint: CHECK (num_nonnulls(gcs_bucket_id, research_environment_id, ...) = 1); b) Adding a new type of supported data source = adding a new column to this table

The concrete DataSources would have separate tables and contain whatever information they need.

alistairewj commented 1 year ago

In a separate discussion, it seems we settled on a fourth approach, where we have a DataSource entity with columns for each type of access.

Current access approaches

We currently have the following mechanisms of accessing file content. All of these assume that a user has the appropriate authorization for the data (which conceptually we would like to handle separately). Here are the current mechanisms of data access:

Direct / "Local"
- Data preparation: N/A - the files are organized during the publication of the project
- Configuration: N/A
- Request: N/A - user does not make an explicit request for this access mechanism.
- Provision: (PhysioNet) files are displayed in the files pane, (HDN) this functionality is disabled by the allow_file_downloads flag
GCP - BigQuery
- Data preparation: A BigQuery dataset with the data must be created by an admin separately.
- Configuration: A Google Admin group e-mail must be created with read permission to the aforementioned BigQuery dataset. On the management page of a project, this Google Admin Group e-mail is added.
- Request: User adds their cloud identity separately in the profile page. User then triggers an API call against Google Admin to add their cloud account to the associated Google Admin Group.
- Provision: Access is then implicit - users can authenticate with their google account and read data using the BigQuery service
GCP - GCS Buckets
- Data preparation: A Google Cloud Storage (GCS) bucket must be created with the data. There is an integration within project management to send files to GCP which in principle does this. I don't know the extent to which this works since we recently had a few issues.
- Configuration: A Google Admin group e-mail must be created with read permission to the aforementioned GCS bucket. On the management page of a project, this Google Admin Group e-mail is added.
  - Also depends on the env: https://github.com/MIT-LCP/physionet-build/blob/83f292a1a4d382a4e109b2fa37d60dd4bbb69be4/.env.example#L42-L50
- Request: User adds their GCP cloud identity separately in the profile page. User then triggers an API call against Google Admin to add their cloud account to the associated Google Admin Group.
- Provision: Access is then implicit - users can authenticate with their google account and navigate to the data via GCS or use gsutil with the bucket name provided
AWS - S3
- Data preparation: Data must be uploaded to an s3 bucket on AWS separately.
- Configuration: On the management page of the project, the s3 bucket is added. The permissions are added directly to the bucket (there is an upper limit on the number of users we can add this way, something between 1000 - 10,000 if I recall).
  - Also depends on the environment being properly configured: https://github.com/MIT-LCP/physionet-build/blob/83f292a1a4d382a4e109b2fa37d60dd4bbb69be4/.env.example#L52-L61
  - Looks like there is some stuff on AWS side handling the requests
- Request: User adds their AWS cloud identity separately to the profile page. User then triggers an API call against AWS to add their cloud account to the s3 bucket.
- Provision: Access is then implicit - users can use the s3 CLI to download the data. The (only?) example of this really used is to support an AWS Athena workflow outlined in this mimic-code tutorial and this associated AWS blog post.
AWS - Open Data
- Data preparation: Metadata is submitted to AWS describing the dataset.
- Configuration: The s3 bucket is added.
- Request: N/A (?)
- Provision: N/A (?).
- It's unclear whether the Open Data program needs to be separate from s3.
Research Environment
- Data preparation: N/A - the files are organized during the publication of the project
- Configuration: N/A - the environment is configured during publication of the project
- Request: N/A - user does not make an explicit request for this access mechanism.
- Provision: User navigates to the Research Environments app

User stories

Tried to order these by how frequently it's used

Direct file access.
- After getting appropriate approval, users can navigate the files on the website. Mostly though they will wget down the files using the supplied command. This has a few edge cases particularly with hierarchical file structures (e.g. wget pulls a whole bunch of index.html files before getting any data, in the case of MIMIC-CXR it pulls ~300,000 index files, more detail in my comment in issue #796.
- Alternatively users download a zip file generated from the management page.
BigQuery
- Almost entirely used by the MIMIC community to get access to the data on BigQuery. Around ~400 users signed up for this since the latest version of MIMIC-IV was released in January (~700 had already signed up before then). For MIMIC-III, which has been up for a few years, there's ~5000 users.
GCS
- Not sure the extent to which this is used. We used to push downloads through GCS but stopped more recently. For MIMIC-CXR, we had around ~1000 users who signed up to the buckets.
AWS
- I think very few (if any) people use the AWS integration, mostly because it's awkward. Those who do use AWS probably download and copy the data to their own instance.

Suggested model

Column	Type / Fields	Description
project_id	integer	Published project primary key
data_source_id	integer	Unique to the type of data access.
files_available	boolean	Whether the access mechanism can support listing / downloading of files.
access_mechanism	enum / string	Groups together separate services which use similar access mechanisms. Allowed values: 'google-group-email', 's3', 'research-environment'
email	string	For GCP access (potentially others if they do e-mail list based access?), this would store the email of the group.
uri	string	The URI for the data on the external service. For s3 this would be of the form `s3://<bucket_name>`, for gsutil this would be of the form `gs://<bucket_name>`.

I think this works? Seems as though we can cover all the use cases with just an email and a uri attribute. Needs some refinement in the access_mechanism var.

Other miscellaneous related issues

Might be worth raising issues for these separately after the initial refactor effort.

There is no way to remove access via the current DataAccess model
There is no consistent mechanism for auditing access (GCP we can use GCP logging, AWS similar story, but it's not in the same place)
Sending of data to GCP files is handled separately from the DataAccess - but ideally these would be integrated / aware of each other in some way
- https://github.com/MIT-LCP/physionet-build/blob/83f292a1a4d382a4e109b2fa37d60dd4bbb69be4/physionet-django/console/views.py#L962-L1001
The GCP integration sends all files including the zip file, but including the zip in these file uploads seems wasteful
- https://github.com/MIT-LCP/physionet-build/blob/83f292a1a4d382a4e109b2fa37d60dd4bbb69be4/physionet-django/console/views.py#L744-L760

kshalot commented 1 year ago

@alistairewj Two quick questions:

What's the purpose of data_source_id? Is it just the primary key or am I missing something? If it's to distinguish between the sources, couldn't it be an enum?

Maybe that's more of a comment than a question, but I don't see the files_available boolean as Whether the access mechanism can support listing / downloading of files. since this information is already embedded within the data source type (we know that dealing with research environments disallows this, we know that local files allow for that etc.). What's interesting to me is how GCS ties into this whole thing. Right now we have two implementations:

a) The GCP model that's used to simply allow access via a bucket. b) The GCSProjectFiles integration that mimics local file access.

So option b) simply looks like a), where the files are directly uploaded to the bucket when creating a project. So maybe there is no reason to draw a distinction between the two. In the case of Health Data Nexus, we could imagine a GCS DataSource that's simply disabled (since this is where the data lives, but is not available to the user). The bucket is then only used when using a research environment - they are very tightly coupled together. For example, what would it mean for a project to have research_environment DataSource, but not the GCS one? Unless we make research environments source-agnostic (so they can, for example, wget the data), they remain a GCP-specific feature, so maybe the model should reflect that for now.

This is very brainstormy - I'll gather more thoughts and write something up more thought through as well, but I figured I'll post this for the sake of discussion.

kshalot commented 1 year ago

I guess I'm thinking about how to revamp data access i.e. granting specific users access to specific datasets. We briefly touched on somehow materializing that.

Because it seems that we have data source (GCS, S3, BigQuery, local, etc.) and data access (direct download, research environment). So in an ideal world, we could define data sources for a dataset and manage access independently. For example, HDN would create an "access via research environment grant" to a source for the participants of an event.

alistairewj commented 1 year ago

@alistairewj Two quick questions:

What's the purpose of data_source_id? Is it just the primary key or am I missing something? If it's to distinguish between the sources, couldn't it be an enum?

Yeah I think it can be an enum, I thought about putting it as an enum initially.

Maybe that's more of a comment than a question, but I don't see the files_available boolean as Whether the access mechanism can support listing / downloading of files. since this information is already embedded within the data source type (we know that dealing with research environments disallows this, we know that local files allow for that etc.).

I tried to recapitulate the discussion with this field (enable checking if a particular data source supports a type of service like serving files), but likely I didn't follow the exact nuance @tompollard / you (@kshalot) had in mind.

What's interesting to me is how GCS ties into this whole thing. Right now we have two implementations:

a) The GCP model that's used to simply allow access via a bucket. b) The GCSProjectFiles integration that mimics local file access.

So option b) simply looks like a), where the files are directly uploaded to the bucket when creating a project. So maybe there is no reason to draw a distinction between the two. In the case of Health Data Nexus, we could imagine a GCS DataSource that's simply disabled (since this is where the data lives, but is not available to the user). The bucket is then only used when using a research environment - they are very tightly coupled together. For example, what would it mean for a project to have research_environment DataSource, but not the GCS one? Unless we make research environments source-agnostic (so they can, for example, wget the data), they remain a GCP-specific feature, so maybe the model should reflect that for now.

So then to be very explicit about this:

DataLocation model

Column	Type / Fields	Description
data_location_id	integer (PK)	Uniquely identifies the project / data location pair
project_id	integer	Published project primary key
data_location	enum / string	The location of the data. Allowed values: 'direct', 'gcs', 'bigquery', 's3'

DataProvision model

Column	Type / Fields	Description
data_location_id	integer	Uniquely identifies the project / data location pair, links to DataLocation model
access_control_group	string	The access control group for the data. For GCS/BigQuery access, this would store the email of the Google Admin group.
uri	string	The URI for the data. For s3 this would be of the form `s3://<bucket_name>`, for gsutil this would be of the form `gs://<bucket_name>`.

Putting this up for discussion today/tomorrow. Not 100% sure about it!

It's nice to separate how the data will be accessed (wget, gsutil, s3, bigquery, research-env) from where the data are stored (direct, gcs)
I suppose the project would check data provisioning only, to determine how to list out the files
There would be some odd constraints to enforce, e.g. only allowed to create a DataProvision with research-env if the DataSource gcs is used.
It seems more complicated than it needs to be, though I can't put my finger on why right now.

kshalot commented 1 year ago

There would be some odd constraints to enforce, e.g. only allowed to create a DataProvision with research-env if the DataSource gcs is used.

Maybe splitting this immediately is not the way to go. This would somewhat shift the way we need to think about access and (I think, just guessing) would require a significant refactor to make it usable in the codebase (e.g. aligning the GCP model with GCSProjectFiles). This could be done in two steps:

Make everything fall under the umbrella of the new DataSource.
Split DataSource into DataLocation and DataProvision.

Separating it is a neat idea, but it seems that source and means of access are not the only things that need to be decoupled, because Events will grant access to a specific DataSource/DataProvision (for example to a research environment or bucket), so they introduce a new mechanism of authorization outside the regular has_access/accessible_by defined on PublishedProject. We also noted some other issues with this model, like the fact that we had to add a separate manager to make it work etc. In general, the authorization logic is very tightly coupled with the code that uses it. Since this discussion is somewhat related, I'll dump a couple of thoughts below.

Authorization

Just for reference, the current authorization mechanisms are:

Credentialing

Applications are represented by the CredentialApplication model.
is_credentialed boolean on the User model.
Can be accepted via the CredentialApplication#accept method. Sets is_credentialed to True.
Can be revoked via the CredentialApplication#revoke method. Sets is_credentialed to False.

Contributor Review Applications

Applications are represented by the DataAccessRequest model.
No special field on User. The status of the DataAccessRequest controls access.
Can we accepted/rejected by the project’s owner via the DataAccessResponseForm.
Can expire. The accesses EOL is controlled by the duration field. If the field is set to NULL, then the DataAccessRequest will never expire.
Can be revoked by the project’s owner.

Training

Submissions are represented by the Training model.
No special field on User. The status of the Training controls access.
Can be accepted via the Training#accept method. Sets status to TrainingStatus.ACCEPETED.
Can be rejected via the Training#rejext method. Sets status to TrainingStatus.REJECTED.
Can expire. The training’s EOL is controlled by the valid_duration field. If the field is set to NULL, then the Training will never expire.
Training cannot be revoked after they are accepted (unless they are removed from the database).

Data Use Agreement

Submissions are represented by the DUASignature model.
No special field on User. The existence of the DUASignature controls access.
Neither expirable nor revokable.

Events

Are represented by the Event model.
Participants are marked by the EventParticipant model.
Available datasets are marked by the EventDataset model.
Event grants access to the related datasets regardless of whether the participant meets the regular criteria that manage access to the project.
Event should also control the source that the participants can use to access the data, i.e. the Event can specify that the related EventDatasets are only available via a research environment and not in a different way.

Since they can grant access independently (e.g. events), the logic either has to be fully aware of how authorization is done (so we can have something like if is_authorized or is_participant_of_relevant_event everywhere, or we'll need separate function for each data source, like:

def is_authorized_to_files(self): ...

def is_authorized_to_research_environment(self): ...

def is_authorized_to_s3_bucket(self): ...

...

It's tightly coupled. Changing the authorization implies changes all over the place.

A second issue, one that was mentioned some weeks ago, is the fact that giving users permission to use an alternate data source (e.g. adding them to a GCP group so they can use a bucket) is not persisted anywhere.

Alternate Solution

I was toying with the idea of creating something like an AccessGrant model which is the authority on whether a specific user has access to the dataset (via a data source/location). This adds a lot of overhead on managing this table, but the entire application can just refer to it as the source of truth about access. A quick visualization:

Access Grant drawio

Or to put it simply - users have multiple grants, each grant gives access to a specific DataSource that relates to a specific PublishedProject. So instead of checking credentialing etc. each time the user accesses the project, the platform would simply check whether a grant exists.

The bad part is managing this - those grants would have to be created/destroyed whenever the conditions change (e.g. the user is not credentialed anymore, training expires etc.) which would be a headache. So this would shift the pain-point from checking authorization to managing who's authorized.

On the other hand, we could probably make do with a code-only refactor here as well - maybe extracting this logic to a separate module/app to encapsulate it so it's not in pieces all over the codebase.

alistairewj commented 1 year ago

For @tompollard your benefit, we decided:

We'll refactor the authorization into a separate app first (no logic changes)
Most of us like the idea of the AccessGrant if we can minimize the footguns, so we'll discuss that further after the refactor

@amitupreti is taking the lead on the refactor of authorization into a separate app (probably named authorization)

tompollard commented 1 year ago

Django 4.2 adds new functionality that "allows configuring multiple custom file storage backends. It also controls storage engines for managing files (the "default" key) and static files (the "staticfiles" key).".

For the release notes, see: https://docs.djangoproject.com/en/4.2/releases/4.2/#custom-file-storages. Seems relevant to our discussions around file storage, cloud, etc, so posting this here.

alistairewj commented 1 year ago

The scope of this issue can obviously expand greatly, so I've tried to simplify into some initial steps.

Refactor event access into the authorization module (#2016)
Create a new model to capture granting usage of data on various services
Decide how that latter model should also incorporate local/direct file access

The main discussion comes around the second point. Requirements are:

We must be able to track who has been granted access on what services
We should support grant() and revoke() methods
We should support the following services:
- Google Cloud Storage
- BigQuery
- AWS S3
- Research Environment

So an initial proposal could be something like this:

class ProjectAccessStatus(Enum):
    GRANTED = 'Granted'
    EXPIRED = 'Expired'
    # removed is used if in a daily clean-up/check with
    # the external services, the user is no longer listed in
    # the external access list, i.e. maybe someone manually
    # removed them from the google group.
    REMOVED = 'Removed'

class ProjectAccessMechanism(Enum):
    DIRECT = 'Direct'
    GOOGLE_GROUP = 'Google Group Email'
    S3_POLICY = 'S3'
    RESEARCH_ENVIRONMENT = 'Research Environment'

class ProjectDataServices(models.Model):
    # implicitly, the lack of an entry in the ProjectDataServices
    # table indicates no access for any external services.
    user = models.ManyToManyField(User)
    project = models.ForeignKey(PublishedProject, on_delete=models.CASCADE)
    # general access related fields
    access_modified_date = models.DateTimeField(auto_now_add=True)
    access_status = models.CharField(max_length=32, choices=ProjectAccessStatus.choices(), editable=False)
    # which type of access does this object indicate was granted
    access_mechanism = models.CharField(max_length=32, choices=ProjectAccessMechanism.choices(), editable=False)
    # access specific attributes like URI, e-mail etc ...

For the access specific attributes, we could include a generic foreign key to external models like GCSDataService which have the service specific attributes. I think this approach is nice as we have to have a custom ResearchDataService for HDN, and it cleanly separates it from the core model. But... it is a generic relation... The alternative is nullable foreign keys, but adding new services implies a change in this original model. This is in effect the polymorphic suggestion Karol had earlier with a bit of a twist.

We'd want to think about how to incorporate event access into this services model, if at all possible, or whether we should simply maintain a separate EventAccess model.

Finally, the third point is how to handle treating direct data access. Currently, we use the allow_file_downloads workaround to disable direct file access on HDN. This works but it is manually configured upon project submission by the author of the project, which is confusing and prone to error. I'm hopeful we can deal with that after finalizing the above model, but mentioning it here in case it raises some relevant questions.

MIT-LCP / physionet-build

Integrate local file access and research environments with the existing `DataAccess` model #1927

Table of Contents

Overview

Use the existing DataAccess abstraction to get rid of implicit project access

File downloads

Research environments

Events

Defaults

Cleanup

Refactor

Using a JSON column

Making the relations explicit

Polymorphic Data Source

Current access approaches

User stories

Suggested model

Other miscellaneous related issues

Authorization

Credentialing

Contributor Review Applications

Training

Data Use Agreement

Events

Alternate Solution

Use the existing `DataAccess` abstraction to get rid of implicit project access