Open kshalot opened 1 year ago
Currently DataAccess does not incorporate Google Cloud Storage (project.models.GCP).
Is your thought that there is one DataAccess representing "ability to view/download source files", or one DataAccess per storage backend?
Currently? One DataAccess
per backend, per project. Regarding Google Cloud Storage, from what I saw there is a separate model for it (as you pointed out project.models.GCP
) but it’s also controlled via the DataAccess
model with platform 3 (gcp-bucket
):
https://github.com/MIT-LCP/physionet-build/blob/cfbfdf32ac1572f67550c0641cd8d7e312b3be13/physionet-django/project/templates/project/published_project_data_access.html#L7-L10
project.gcp
is a relation with project.models.GCP
.
One architecture would be to have a similar relation would exist for other data accesses as well. I could imagine a polymorphic relation linking DataAccess
to models specific to data source backends.
Also, naming the platform local
is not really accurate now that I think about it, because we also have the GCS ProjectFiles
backend that serves the files from a bucket. So direct
or something along those lines is probably more accurate.
And stemming from that last point, there seems to be some redundancy here that ideally should be refactored - the GCS ProjectFiles
backend and the project.models.GCP
in theory both model accessing files via GCS. Their use cases are different, but at least the low-level mechanisms are very similar.
I was thinking that before considering this and other major refactors, we could at least make any data access go through the DataAccess
model to make it easier to reason about when thinking about things like changes in the way access is managed in the platform. For example, materializing that access was granted to the user for a specific backend, will make it explicit who has access to what and how, while also integrating with events much easier (there are a lot of questions to answer in this approach though).
Currently? One
DataAccess
per backend, per project. Regarding Google Cloud Storage, from what I saw there is a separate model for it (as you pointed outproject.models.GCP
) but it’s also controlled via theDataAccess
model with platform 3 (gcp-bucket
): https://github.com/MIT-LCP/physionet-build/blob/cfbfdf32ac1572f67550c0641cd8d7e312b3be13/physionet-django/project/templates/project/published_project_data_access.html#L7-L10
Ah, okay.
When we create a GCP bucket with "send files to GCP", this creates a GCP object. It does not create a DataAccess object.
There are no DataAccess objects (with platform=3) for open access projects. There are DataAccess objects for some (but not all) restricted projects. I guess that somebody (Tom? Felipe?) must be creating these by hand.
Looking at this more, DataAccess is even more of a mess than I thought. "location" is sometimes a location and sometimes an access control group.
The names of the access groups are apparently chosen ad-hoc, & there can be multiple access groups for the same bucket.
I agree with the principle of trying to unify access methods through a single class, but I think it might be better to define a new class rather than trying to retrofit DataAccess.
Yes, and adding the new accesses will introduce a third case, where location
is blank. I'm all for creating a new class - at first I though it would be simpler to reason about the changes if we align everything and then refactor it, but that's backwards. After looking at it closer and your comments @bemoody I'd say we should:
DataAccess
usage to a new mechanism that we are happy with.There's no point in introducing more technical debt now by thinking of ways to make it all fall together. I see a couple of options:
PostgreSQL has great support for JSON, so the location
column could simply be replaced by a metadata
column of the json
/jsonb
type. The obvious disadvantage is that the overhead of managing a column like this is higher since there are no guarantees in regard to its schema. In our case, there is currently no need to use this column for search (although it's possible to index such a column in case we need to search for particular key/value pairs).
One simple solution is to have explicit relations for each of the existing/future DataSource
s. For example:
This can be a 1-to-many as well. Retrieving a project will be heavier since N tables will have to be joined instead of 1 (the original DataAccess
). Doesn't seem like a huge issue though, in my opinion.
This approach would be useful in case this relation was Many-to-Many, which might not be the case any time soon/ever.
The idea is to create a DataSource
entity that would:
PublishedProject
entity.
b) The primary key of the concrete DataSource
entity (GCSBucket
, ResearchEnvironment
etc.) that could be autogenerated as coalesce(gcs_bucket_id, research_environment_id, ...)
CHECK (num_nonnulls(gcs_bucket_id, research_environment_id, ...) = 1);
b) Adding a new type of supported data source = adding a new column to this tableThe concrete DataSource
s would have separate tables and contain whatever information they need.
In a separate discussion, it seems we settled on a fourth approach, where we have a DataSource
entity with columns for each type of access.
We currently have the following mechanisms of accessing file content. All of these assume that a user has the appropriate authorization for the data (which conceptually we would like to handle separately). Here are the current mechanisms of data access:
allow_file_downloads
flagTried to order these by how frequently it's used
wget
down the files using the supplied command. This has a few edge cases particularly with hierarchical file structures (e.g. wget
pulls a whole bunch of index.html files before getting any data, in the case of MIMIC-CXR it pulls ~300,000 index files, more detail in my comment in issue #796.Column | Type / Fields | Description |
---|---|---|
project_id | integer | Published project primary key |
data_source_id | integer | Unique to the type of data access. |
files_available | boolean | Whether the access mechanism can support listing / downloading of files. |
access_mechanism | enum / string | Groups together separate services which use similar access mechanisms. Allowed values: 'google-group-email', 's3', 'research-environment' |
string | For GCP access (potentially others if they do e-mail list based access?), this would store the email of the group. | |
uri | string | The URI for the data on the external service. For s3 this would be of the form s3://<bucket_name> , for gsutil this would be of the form gs://<bucket_name> . |
I think this works? Seems as though we can cover all the use cases with just an email
and a uri
attribute. Needs some refinement in the access_mechanism
var.
Might be worth raising issues for these separately after the initial refactor effort.
@alistairewj Two quick questions:
What's the purpose of data_source_id
? Is it just the primary key or am I missing something? If it's to distinguish between the sources, couldn't it be an enum?
Maybe that's more of a comment than a question, but I don't see the files_available
boolean as Whether the access mechanism can support listing / downloading of files.
since this information is already embedded within the data source type (we know that dealing with research environments disallows this, we know that local files allow for that etc.). What's interesting to me is how GCS ties into this whole thing. Right now we have two implementations:
a) The GCP
model that's used to simply allow access via a bucket.
b) The GCSProjectFiles
integration that mimics local file access.
So option b) simply looks like a), where the files are directly uploaded to the bucket when creating a project. So maybe there is no reason to draw a distinction between the two. In the case of Health Data Nexus, we could imagine a GCS
DataSource
that's simply disabled (since this is where the data lives, but is not available to the user). The bucket is then only used when using a research environment - they are very tightly coupled together. For example, what would it mean for a project to have research_environment
DataSource
, but not the GCS
one? Unless we make research environments source-agnostic (so they can, for example, wget
the data), they remain a GCP-specific feature, so maybe the model should reflect that for now.
This is very brainstormy - I'll gather more thoughts and write something up more thought through as well, but I figured I'll post this for the sake of discussion.
I guess I'm thinking about how to revamp data access i.e. granting specific users access to specific datasets. We briefly touched on somehow materializing that.
Because it seems that we have data source (GCS, S3, BigQuery, local, etc.) and data access (direct download, research environment). So in an ideal world, we could define data sources for a dataset and manage access independently. For example, HDN would create an "access via research environment grant" to a source for the participants of an event.
@alistairewj Two quick questions:
What's the purpose of
data_source_id
? Is it just the primary key or am I missing something? If it's to distinguish between the sources, couldn't it be an enum?
Yeah I think it can be an enum, I thought about putting it as an enum initially.
Maybe that's more of a comment than a question, but I don't see the
files_available
boolean asWhether the access mechanism can support listing / downloading of files.
since this information is already embedded within the data source type (we know that dealing with research environments disallows this, we know that local files allow for that etc.).
I tried to recapitulate the discussion with this field (enable checking if a particular data source supports a type of service like serving files), but likely I didn't follow the exact nuance @tompollard / you (@kshalot) had in mind.
What's interesting to me is how GCS ties into this whole thing. Right now we have two implementations:
a) The
GCP
model that's used to simply allow access via a bucket. b) TheGCSProjectFiles
integration that mimics local file access.So option b) simply looks like a), where the files are directly uploaded to the bucket when creating a project. So maybe there is no reason to draw a distinction between the two. In the case of Health Data Nexus, we could imagine a
GCS
DataSource
that's simply disabled (since this is where the data lives, but is not available to the user). The bucket is then only used when using a research environment - they are very tightly coupled together. For example, what would it mean for a project to haveresearch_environment
DataSource
, but not theGCS
one? Unless we make research environments source-agnostic (so they can, for example,wget
the data), they remain a GCP-specific feature, so maybe the model should reflect that for now.
So then to be very explicit about this:
DataLocation model
Column | Type / Fields | Description |
---|---|---|
data_location_id | integer (PK) | Uniquely identifies the project / data location pair |
project_id | integer | Published project primary key |
data_location | enum / string | The location of the data. Allowed values: 'direct', 'gcs', 'bigquery', 's3' |
DataProvision model
Column | Type / Fields | Description |
---|---|---|
data_location_id | integer | Uniquely identifies the project / data location pair, links to DataLocation model |
access_control_group | string | The access control group for the data. For GCS/BigQuery access, this would store the email of the Google Admin group. |
uri | string | The URI for the data. For s3 this would be of the form s3://<bucket_name> , for gsutil this would be of the form gs://<bucket_name> . |
Putting this up for discussion today/tomorrow. Not 100% sure about it!
wget
, gsutil
, s3
, bigquery
, research-env
) from where the data are stored (direct
, gcs
)
- There would be some odd constraints to enforce, e.g. only allowed to create a DataProvision with research-env if the DataSource gcs is used.
Maybe splitting this immediately is not the way to go. This would somewhat shift the way we need to think about access and (I think, just guessing) would require a significant refactor to make it usable in the codebase (e.g. aligning the GCP
model with GCSProjectFiles
). This could be done in two steps:
DataSource
.DataSource
into DataLocation
and DataProvision
.Separating it is a neat idea, but it seems that source and means of access are not the only things that need to be decoupled, because Events will grant access to a specific DataSource
/DataProvision
(for example to a research environment or bucket), so they introduce a new mechanism of authorization outside the regular has_access
/accessible_by
defined on PublishedProject
. We also noted some other issues with this model, like the fact that we had to add a separate manager to make it work etc. In general, the authorization logic is very tightly coupled with the code that uses it. Since this discussion is somewhat related, I'll dump a couple of thoughts below.
Just for reference, the current authorization mechanisms are:
CredentialApplication
model.is_credentialed
boolean on the User
model.CredentialApplication#accept
method. Sets is_credentialed
to True
.CredentialApplication#revoke
method. Sets is_credentialed
to False
.DataAccessRequest
model.User
. The status of the DataAccessRequest
controls access.DataAccessResponseForm
.duration
field. If the field is set to NULL
, then the DataAccessRequest
will never expire.Training
model.User
. The status of the Training
controls access.Training#accept
method. Sets status
to TrainingStatus.ACCEPETED
.Training#rejext
method. Sets status
to TrainingStatus.REJECTED
.valid_duration
field. If the field is set to NULL
, then the Training
will never expire.Training
cannot be revoked after they are accepted (unless they are removed from the database).DUASignature
model.User
. The existence of the DUASignature
controls access.Event
model.EventParticipant
model.EventDataset
model.Event
grants access to the related datasets regardless of whether the participant meets the regular criteria that manage access to the project.Event
should also control the source that the participants can use to access the data, i.e. the Event
can specify that the related EventDataset
s are only available via a research environment and not in a different way.Since they can grant access independently (e.g. events), the logic either has to be fully aware of how authorization is done (so we can have something like if is_authorized or is_participant_of_relevant_event
everywhere, or we'll need separate function for each data source, like:
def is_authorized_to_files(self): ...
def is_authorized_to_research_environment(self): ...
def is_authorized_to_s3_bucket(self): ...
...
It's tightly coupled. Changing the authorization implies changes all over the place.
A second issue, one that was mentioned some weeks ago, is the fact that giving users permission to use an alternate data source (e.g. adding them to a GCP group so they can use a bucket) is not persisted anywhere.
I was toying with the idea of creating something like an AccessGrant
model which is the authority on whether a specific user has access to the dataset (via a data source/location). This adds a lot of overhead on managing this table, but the entire application can just refer to it as the source of truth about access. A quick visualization:
Or to put it simply - users have multiple grants, each grant gives access to a specific DataSource
that relates to a specific PublishedProject
. So instead of checking credentialing etc. each time the user accesses the project, the platform would simply check whether a grant exists.
The bad part is managing this - those grants would have to be created/destroyed whenever the conditions change (e.g. the user is not credentialed anymore, training expires etc.) which would be a headache. So this would shift the pain-point from checking authorization to managing who's authorized.
On the other hand, we could probably make do with a code-only refactor here as well - maybe extracting this logic to a separate module/app to encapsulate it so it's not in pieces all over the codebase.
For @tompollard your benefit, we decided:
AccessGrant
if we can minimize the footguns, so we'll discuss that further after the refactor@amitupreti is taking the lead on the refactor of authorization into a separate app (probably named authorization
)
Django 4.2 adds new functionality that "allows configuring multiple custom file storage backends. It also controls storage engines for managing files (the "default" key) and static files (the "staticfiles" key).".
For the release notes, see: https://docs.djangoproject.com/en/4.2/releases/4.2/#custom-file-storages. Seems relevant to our discussions around file storage, cloud, etc, so posting this here.
The scope of this issue can obviously expand greatly, so I've tried to simplify into some initial steps.
The main discussion comes around the second point. Requirements are:
grant()
and revoke()
methodsSo an initial proposal could be something like this:
class ProjectAccessStatus(Enum):
GRANTED = 'Granted'
EXPIRED = 'Expired'
# removed is used if in a daily clean-up/check with
# the external services, the user is no longer listed in
# the external access list, i.e. maybe someone manually
# removed them from the google group.
REMOVED = 'Removed'
class ProjectAccessMechanism(Enum):
DIRECT = 'Direct'
GOOGLE_GROUP = 'Google Group Email'
S3_POLICY = 'S3'
RESEARCH_ENVIRONMENT = 'Research Environment'
class ProjectDataServices(models.Model):
# implicitly, the lack of an entry in the ProjectDataServices
# table indicates no access for any external services.
user = models.ManyToManyField(User)
project = models.ForeignKey(PublishedProject, on_delete=models.CASCADE)
# general access related fields
access_modified_date = models.DateTimeField(auto_now_add=True)
access_status = models.CharField(max_length=32, choices=ProjectAccessStatus.choices(), editable=False)
# which type of access does this object indicate was granted
access_mechanism = models.CharField(max_length=32, choices=ProjectAccessMechanism.choices(), editable=False)
# access specific attributes like URI, e-mail etc ...
For the access specific attributes, we could include a generic foreign key to external models like GCSDataService which have the service specific attributes. I think this approach is nice as we have to have a custom ResearchDataService for HDN, and it cleanly separates it from the core model. But... it is a generic relation... The alternative is nullable foreign keys, but adding new services implies a change in this original model. This is in effect the polymorphic suggestion Karol had earlier with a bit of a twist.
We'd want to think about how to incorporate event access into this services model, if at all possible, or whether we should simply maintain a separate EventAccess
model.
Finally, the third point is how to handle treating direct data access. Currently, we use the allow_file_downloads
workaround to disable direct file access on HDN. This works but it is manually configured upon project submission by the author of the project, which is confusing and prone to error. I'm hopeful we can deal with that after finalizing the above model, but mentioning it here in case it raises some relevant questions.
(I used orgmode to generate this so now it looks more serious than it actually is)
Table of Contents
DataAccess
abstraction to get rid of implicit project accessOverview
Use the existing
DataAccess
abstraction to get rid of implicit project accessBefore refactoring the
DataAccess
functionality itself, the mechanism could be made generic (i.e. local file access and research environment access should be controlled by theDataAccess
model). This will make any access via any source fall under the same mechanism so it’s easier to reason about a potential refactor in the future.File downloads
DataAccess
already contains alocal
platform which is unused.DataAccess(platform=0, location=None)
.if item.platform == 0
.local
access is enabled before serving files/zip (technically, the Google Cloud StorageProjectFiles
backend should also supportwget
. This can be aligned later when needed).DataAccess
for each PublishedProject.Research environments
DataAccess(platform=5, location="GCP group controlling access to the dataset")
.research_environment
platform toDataAccess
.ENABLE_CLOUD_RESEARCH_ENVIRONMENTS
is set toTrue
.if item.platform == 5
.CloudIdentity
, direct them to the setup.hdn-research-environments
package - only display projects with an assignedresearch_environment
DataAccess
as available.research_environment
DataAccess
for each PublishedProject.Events
DataAccess
. This means that we have to start operating with higher granularity when it comes to access to data (i.e. not check whether someone is authorized, but specifically checkPublishedProject#can_create_research_environment
for example).Defaults
research_environment
for Health Data Nexus).DataAccess
should be created automatically.local
setting,research_environment
has to have alocation
set. The logic that generates those values is controlled by the externalhdn_research_environment
package (as it should).location
can be set toNone
. Then the application can do a simple checkCleanup
DataAccess
replaces theallow_file_downloads
setting that can be now removed.DataAccess
could be renamed toDataSource
to distinguish it from things likeDataAccessRequest
,AnonymousAccess
etc.Refactor