MIT-LCP / physionet-build

The new PhysioNet platform.
https://physionet.org/
BSD 3-Clause "New" or "Revised" License
56 stars 20 forks source link

Integrate local file access and research environments with the existing `DataAccess` model #1927

Open kshalot opened 1 year ago

kshalot commented 1 year ago

(I used orgmode to generate this so now it looks more serious than it actually is)

Table of Contents

  1. Overview
    1. Use the existing DataAccess abstraction to get rid of implicit project access
      1. File downloads
      2. Research environments
      3. Events
      4. Defaults
      5. Cleanup
    2. Refactor

Overview

Use the existing DataAccess abstraction to get rid of implicit project access

Before refactoring the DataAccess functionality itself, the mechanism could be made generic (i.e. local file access and research environment access should be controlled by the DataAccess model). This will make any access via any source fall under the same mechanism so it’s easier to reason about a potential refactor in the future.

File downloads

Research environments

Events

Defaults

Cleanup

Refactor

bemoody commented 1 year ago

Currently DataAccess does not incorporate Google Cloud Storage (project.models.GCP).

Is your thought that there is one DataAccess representing "ability to view/download source files", or one DataAccess per storage backend?

kshalot commented 1 year ago

Currently? One DataAccess per backend, per project. Regarding Google Cloud Storage, from what I saw there is a separate model for it (as you pointed out project.models.GCP) but it’s also controlled via the DataAccess model with platform 3 (gcp-bucket): https://github.com/MIT-LCP/physionet-build/blob/cfbfdf32ac1572f67550c0641cd8d7e312b3be13/physionet-django/project/templates/project/published_project_data_access.html#L7-L10 project.gcp is a relation with project.models.GCP.

One architecture would be to have a similar relation would exist for other data accesses as well. I could imagine a polymorphic relation linking DataAccess to models specific to data source backends.

Also, naming the platform local is not really accurate now that I think about it, because we also have the GCS ProjectFiles backend that serves the files from a bucket. So direct or something along those lines is probably more accurate.

And stemming from that last point, there seems to be some redundancy here that ideally should be refactored - the GCS ProjectFiles backend and the project.models.GCP in theory both model accessing files via GCS. Their use cases are different, but at least the low-level mechanisms are very similar.

I was thinking that before considering this and other major refactors, we could at least make any data access go through the DataAccess model to make it easier to reason about when thinking about things like changes in the way access is managed in the platform. For example, materializing that access was granted to the user for a specific backend, will make it explicit who has access to what and how, while also integrating with events much easier (there are a lot of questions to answer in this approach though).

bemoody commented 1 year ago

Currently? One DataAccess per backend, per project. Regarding Google Cloud Storage, from what I saw there is a separate model for it (as you pointed out project.models.GCP) but it’s also controlled via the DataAccess model with platform 3 (gcp-bucket): https://github.com/MIT-LCP/physionet-build/blob/cfbfdf32ac1572f67550c0641cd8d7e312b3be13/physionet-django/project/templates/project/published_project_data_access.html#L7-L10

Ah, okay.

When we create a GCP bucket with "send files to GCP", this creates a GCP object. It does not create a DataAccess object.

There are no DataAccess objects (with platform=3) for open access projects. There are DataAccess objects for some (but not all) restricted projects. I guess that somebody (Tom? Felipe?) must be creating these by hand.

bemoody commented 1 year ago

Looking at this more, DataAccess is even more of a mess than I thought. "location" is sometimes a location and sometimes an access control group.

The names of the access groups are apparently chosen ad-hoc, & there can be multiple access groups for the same bucket.

I agree with the principle of trying to unify access methods through a single class, but I think it might be better to define a new class rather than trying to retrofit DataAccess.

kshalot commented 1 year ago

Yes, and adding the new accesses will introduce a third case, where location is blank. I'm all for creating a new class - at first I though it would be simpler to reason about the changes if we align everything and then refactor it, but that's backwards. After looking at it closer and your comments @bemoody I'd say we should:

  1. Migrate the existing DataAccess usage to a new mechanism that we are happy with.
  2. Migrate local and research environments access to the new model.

There's no point in introducing more technical debt now by thinking of ways to make it all fall together. I see a couple of options:

Using a JSON column

PostgreSQL has great support for JSON, so the location column could simply be replaced by a metadata column of the json/jsonb type. The obvious disadvantage is that the overhead of managing a column like this is higher since there are no guarantees in regard to its schema. In our case, there is currently no need to use this column for search (although it's possible to index such a column in case we need to search for particular key/value pairs).

Making the relations explicit

One simple solution is to have explicit relations for each of the existing/future DataSources. For example:

Denormalized drawio (1)

This can be a 1-to-many as well. Retrieving a project will be heavier since N tables will have to be joined instead of 1 (the original DataAccess). Doesn't seem like a huge issue though, in my opinion.

Polymorphic Data Source

This approach would be useful in case this relation was Many-to-Many, which might not be the case any time soon/ever.

Data Source Proposal drawio

The idea is to create a DataSource entity that would:

  1. Have a composite primary key consisting of: a) The primary key of the PublishedProject entity. b) The primary key of the concrete DataSource entity (GCSBucket, ResearchEnvironment etc.) that could be autogenerated as coalesce(gcs_bucket_id, research_environment_id, ...)
  2. Have a column for each of the possible concrete data sources: a) Enforce that one and only one of these column is present. For example via a constraint: CHECK (num_nonnulls(gcs_bucket_id, research_environment_id, ...) = 1); b) Adding a new type of supported data source = adding a new column to this table

The concrete DataSources would have separate tables and contain whatever information they need.

alistairewj commented 1 year ago

In a separate discussion, it seems we settled on a fourth approach, where we have a DataSource entity with columns for each type of access.

Current access approaches

We currently have the following mechanisms of accessing file content. All of these assume that a user has the appropriate authorization for the data (which conceptually we would like to handle separately). Here are the current mechanisms of data access:

  1. Direct / "Local"
    • Data preparation: N/A - the files are organized during the publication of the project
    • Configuration: N/A
    • Request: N/A - user does not make an explicit request for this access mechanism.
    • Provision: (PhysioNet) files are displayed in the files pane, (HDN) this functionality is disabled by the allow_file_downloads flag
  2. GCP - BigQuery
    • Data preparation: A BigQuery dataset with the data must be created by an admin separately.
    • Configuration: A Google Admin group e-mail must be created with read permission to the aforementioned BigQuery dataset. On the management page of a project, this Google Admin Group e-mail is added.
    • Request: User adds their cloud identity separately in the profile page. User then triggers an API call against Google Admin to add their cloud account to the associated Google Admin Group.
    • Provision: Access is then implicit - users can authenticate with their google account and read data using the BigQuery service
  3. GCP - GCS Buckets
    • Data preparation: A Google Cloud Storage (GCS) bucket must be created with the data. There is an integration within project management to send files to GCP which in principle does this. I don't know the extent to which this works since we recently had a few issues.
    • Configuration: A Google Admin group e-mail must be created with read permission to the aforementioned GCS bucket. On the management page of a project, this Google Admin Group e-mail is added.
    • Request: User adds their GCP cloud identity separately in the profile page. User then triggers an API call against Google Admin to add their cloud account to the associated Google Admin Group.
    • Provision: Access is then implicit - users can authenticate with their google account and navigate to the data via GCS or use gsutil with the bucket name provided
  4. AWS - S3
    • Data preparation: Data must be uploaded to an s3 bucket on AWS separately.
    • Configuration: On the management page of the project, the s3 bucket is added. The permissions are added directly to the bucket (there is an upper limit on the number of users we can add this way, something between 1000 - 10,000 if I recall).
    • Request: User adds their AWS cloud identity separately to the profile page. User then triggers an API call against AWS to add their cloud account to the s3 bucket.
    • Provision: Access is then implicit - users can use the s3 CLI to download the data. The (only?) example of this really used is to support an AWS Athena workflow outlined in this mimic-code tutorial and this associated AWS blog post.
  5. AWS - Open Data
    • Data preparation: Metadata is submitted to AWS describing the dataset.
    • Configuration: The s3 bucket is added.
    • Request: N/A (?)
    • Provision: N/A (?).
    • It's unclear whether the Open Data program needs to be separate from s3.
  6. Research Environment
    • Data preparation: N/A - the files are organized during the publication of the project
    • Configuration: N/A - the environment is configured during publication of the project
    • Request: N/A - user does not make an explicit request for this access mechanism.
    • Provision: User navigates to the Research Environments app

User stories

Tried to order these by how frequently it's used

  1. Direct file access.
    • After getting appropriate approval, users can navigate the files on the website. Mostly though they will wget down the files using the supplied command. This has a few edge cases particularly with hierarchical file structures (e.g. wget pulls a whole bunch of index.html files before getting any data, in the case of MIMIC-CXR it pulls ~300,000 index files, more detail in my comment in issue #796.
    • Alternatively users download a zip file generated from the management page.
  2. BigQuery
    • Almost entirely used by the MIMIC community to get access to the data on BigQuery. Around ~400 users signed up for this since the latest version of MIMIC-IV was released in January (~700 had already signed up before then). For MIMIC-III, which has been up for a few years, there's ~5000 users.
  3. GCS
    • Not sure the extent to which this is used. We used to push downloads through GCS but stopped more recently. For MIMIC-CXR, we had around ~1000 users who signed up to the buckets.
  4. AWS
    • I think very few (if any) people use the AWS integration, mostly because it's awkward. Those who do use AWS probably download and copy the data to their own instance.

Suggested model

Column Type / Fields Description
project_id integer Published project primary key
data_source_id integer Unique to the type of data access.
files_available boolean Whether the access mechanism can support listing / downloading of files.
access_mechanism enum / string Groups together separate services which use similar access mechanisms. Allowed values: 'google-group-email', 's3', 'research-environment'
email string For GCP access (potentially others if they do e-mail list based access?), this would store the email of the group.
uri string The URI for the data on the external service. For s3 this would be of the form s3://<bucket_name>, for gsutil this would be of the form gs://<bucket_name>.

I think this works? Seems as though we can cover all the use cases with just an email and a uri attribute. Needs some refinement in the access_mechanism var.

Other miscellaneous related issues

Might be worth raising issues for these separately after the initial refactor effort.

kshalot commented 1 year ago

@alistairewj Two quick questions:

What's the purpose of data_source_id? Is it just the primary key or am I missing something? If it's to distinguish between the sources, couldn't it be an enum?

Maybe that's more of a comment than a question, but I don't see the files_available boolean as Whether the access mechanism can support listing / downloading of files. since this information is already embedded within the data source type (we know that dealing with research environments disallows this, we know that local files allow for that etc.). What's interesting to me is how GCS ties into this whole thing. Right now we have two implementations:

a) The GCP model that's used to simply allow access via a bucket. b) The GCSProjectFiles integration that mimics local file access.

So option b) simply looks like a), where the files are directly uploaded to the bucket when creating a project. So maybe there is no reason to draw a distinction between the two. In the case of Health Data Nexus, we could imagine a GCS DataSource that's simply disabled (since this is where the data lives, but is not available to the user). The bucket is then only used when using a research environment - they are very tightly coupled together. For example, what would it mean for a project to have research_environment DataSource, but not the GCS one? Unless we make research environments source-agnostic (so they can, for example, wget the data), they remain a GCP-specific feature, so maybe the model should reflect that for now.

This is very brainstormy - I'll gather more thoughts and write something up more thought through as well, but I figured I'll post this for the sake of discussion.

kshalot commented 1 year ago

I guess I'm thinking about how to revamp data access i.e. granting specific users access to specific datasets. We briefly touched on somehow materializing that.

Because it seems that we have data source (GCS, S3, BigQuery, local, etc.) and data access (direct download, research environment). So in an ideal world, we could define data sources for a dataset and manage access independently. For example, HDN would create an "access via research environment grant" to a source for the participants of an event.

alistairewj commented 1 year ago

@alistairewj Two quick questions:

What's the purpose of data_source_id? Is it just the primary key or am I missing something? If it's to distinguish between the sources, couldn't it be an enum?

Yeah I think it can be an enum, I thought about putting it as an enum initially.

Maybe that's more of a comment than a question, but I don't see the files_available boolean as Whether the access mechanism can support listing / downloading of files. since this information is already embedded within the data source type (we know that dealing with research environments disallows this, we know that local files allow for that etc.).

I tried to recapitulate the discussion with this field (enable checking if a particular data source supports a type of service like serving files), but likely I didn't follow the exact nuance @tompollard / you (@kshalot) had in mind.

What's interesting to me is how GCS ties into this whole thing. Right now we have two implementations:

a) The GCP model that's used to simply allow access via a bucket. b) The GCSProjectFiles integration that mimics local file access.

So option b) simply looks like a), where the files are directly uploaded to the bucket when creating a project. So maybe there is no reason to draw a distinction between the two. In the case of Health Data Nexus, we could imagine a GCS DataSource that's simply disabled (since this is where the data lives, but is not available to the user). The bucket is then only used when using a research environment - they are very tightly coupled together. For example, what would it mean for a project to have research_environment DataSource, but not the GCS one? Unless we make research environments source-agnostic (so they can, for example, wget the data), they remain a GCP-specific feature, so maybe the model should reflect that for now.

So then to be very explicit about this:

DataLocation model

Column Type / Fields Description
data_location_id integer (PK) Uniquely identifies the project / data location pair
project_id integer Published project primary key
data_location enum / string The location of the data. Allowed values: 'direct', 'gcs', 'bigquery', 's3'

DataProvision model

Column Type / Fields Description
data_location_id integer Uniquely identifies the project / data location pair, links to DataLocation model
access_control_group string The access control group for the data. For GCS/BigQuery access, this would store the email of the Google Admin group.
uri string The URI for the data. For s3 this would be of the form s3://<bucket_name>, for gsutil this would be of the form gs://<bucket_name>.

Putting this up for discussion today/tomorrow. Not 100% sure about it!

  1. It's nice to separate how the data will be accessed (wget, gsutil, s3, bigquery, research-env) from where the data are stored (direct, gcs)
  2. I suppose the project would check data provisioning only, to determine how to list out the files
  3. There would be some odd constraints to enforce, e.g. only allowed to create a DataProvision with research-env if the DataSource gcs is used.
  4. It seems more complicated than it needs to be, though I can't put my finger on why right now.
kshalot commented 1 year ago
  1. There would be some odd constraints to enforce, e.g. only allowed to create a DataProvision with research-env if the DataSource gcs is used.

Maybe splitting this immediately is not the way to go. This would somewhat shift the way we need to think about access and (I think, just guessing) would require a significant refactor to make it usable in the codebase (e.g. aligning the GCP model with GCSProjectFiles). This could be done in two steps:

  1. Make everything fall under the umbrella of the new DataSource.
  2. Split DataSource into DataLocation and DataProvision.

Separating it is a neat idea, but it seems that source and means of access are not the only things that need to be decoupled, because Events will grant access to a specific DataSource/DataProvision (for example to a research environment or bucket), so they introduce a new mechanism of authorization outside the regular has_access/accessible_by defined on PublishedProject. We also noted some other issues with this model, like the fact that we had to add a separate manager to make it work etc. In general, the authorization logic is very tightly coupled with the code that uses it. Since this discussion is somewhat related, I'll dump a couple of thoughts below.

Authorization

Just for reference, the current authorization mechanisms are:

Credentialing

Contributor Review Applications

Training

Data Use Agreement

Events

Since they can grant access independently (e.g. events), the logic either has to be fully aware of how authorization is done (so we can have something like if is_authorized or is_participant_of_relevant_event everywhere, or we'll need separate function for each data source, like:

def is_authorized_to_files(self): ...

def is_authorized_to_research_environment(self): ...

def is_authorized_to_s3_bucket(self): ...

...

It's tightly coupled. Changing the authorization implies changes all over the place.

A second issue, one that was mentioned some weeks ago, is the fact that giving users permission to use an alternate data source (e.g. adding them to a GCP group so they can use a bucket) is not persisted anywhere.

Alternate Solution

I was toying with the idea of creating something like an AccessGrant model which is the authority on whether a specific user has access to the dataset (via a data source/location). This adds a lot of overhead on managing this table, but the entire application can just refer to it as the source of truth about access. A quick visualization:

Access Grant drawio

Or to put it simply - users have multiple grants, each grant gives access to a specific DataSource that relates to a specific PublishedProject. So instead of checking credentialing etc. each time the user accesses the project, the platform would simply check whether a grant exists.

The bad part is managing this - those grants would have to be created/destroyed whenever the conditions change (e.g. the user is not credentialed anymore, training expires etc.) which would be a headache. So this would shift the pain-point from checking authorization to managing who's authorized.

On the other hand, we could probably make do with a code-only refactor here as well - maybe extracting this logic to a separate module/app to encapsulate it so it's not in pieces all over the codebase.

alistairewj commented 1 year ago

For @tompollard your benefit, we decided:

  1. We'll refactor the authorization into a separate app first (no logic changes)
  2. Most of us like the idea of the AccessGrant if we can minimize the footguns, so we'll discuss that further after the refactor

@amitupreti is taking the lead on the refactor of authorization into a separate app (probably named authorization)

tompollard commented 1 year ago

Django 4.2 adds new functionality that "allows configuring multiple custom file storage backends. It also controls storage engines for managing files (the "default" key) and static files (the "staticfiles" key).".

For the release notes, see: https://docs.djangoproject.com/en/4.2/releases/4.2/#custom-file-storages. Seems relevant to our discussions around file storage, cloud, etc, so posting this here.

alistairewj commented 1 year ago

The scope of this issue can obviously expand greatly, so I've tried to simplify into some initial steps.

  1. Refactor event access into the authorization module (#2016)
  2. Create a new model to capture granting usage of data on various services
  3. Decide how that latter model should also incorporate local/direct file access

The main discussion comes around the second point. Requirements are:

So an initial proposal could be something like this:

class ProjectAccessStatus(Enum):
    GRANTED = 'Granted'
    EXPIRED = 'Expired'
    # removed is used if in a daily clean-up/check with
    # the external services, the user is no longer listed in
    # the external access list, i.e. maybe someone manually
    # removed them from the google group.
    REMOVED = 'Removed'

class ProjectAccessMechanism(Enum):
    DIRECT = 'Direct'
    GOOGLE_GROUP = 'Google Group Email'
    S3_POLICY = 'S3'
    RESEARCH_ENVIRONMENT = 'Research Environment'

class ProjectDataServices(models.Model):
    # implicitly, the lack of an entry in the ProjectDataServices
    # table indicates no access for any external services.
    user = models.ManyToManyField(User)
    project = models.ForeignKey(PublishedProject, on_delete=models.CASCADE)
    # general access related fields
    access_modified_date = models.DateTimeField(auto_now_add=True)
    access_status = models.CharField(max_length=32, choices=ProjectAccessStatus.choices(), editable=False)
    # which type of access does this object indicate was granted
    access_mechanism = models.CharField(max_length=32, choices=ProjectAccessMechanism.choices(), editable=False)
    # access specific attributes like URI, e-mail etc ...

For the access specific attributes, we could include a generic foreign key to external models like GCSDataService which have the service specific attributes. I think this approach is nice as we have to have a custom ResearchDataService for HDN, and it cleanly separates it from the core model. But... it is a generic relation... The alternative is nullable foreign keys, but adding new services implies a change in this original model. This is in effect the polymorphic suggestion Karol had earlier with a bit of a twist.

We'd want to think about how to incorporate event access into this services model, if at all possible, or whether we should simply maintain a separate EventAccess model.

Finally, the third point is how to handle treating direct data access. Currently, we use the allow_file_downloads workaround to disable direct file access on HDN. This works but it is manually configured upon project submission by the author of the project, which is confusing and prone to error. I'm hopeful we can deal with that after finalizing the above model, but mentioning it here in case it raises some relevant questions.