Authorization breaks down once a dataset is moved or renamed, can't download

shevron commented 3 years ago

Right now, when a dataset is moved to a different organization, or is renamed, or the organization is renamed, authorization will break down and the resource will no longer be available for download.

To reproduce:

Set Giftless up in a manner that requires JWT based authorization to read a file
Create a private dataset + resource stored in Blob Storage.
Move the dataset to a new organization or rename the dataset or rename the organization
Try to download the file

Analysis

In get_authz_token() we obtain an authorization token to download resources of the organization / dataset who's name is saved in lfs_prefix.

Problem 1: when the name is no longer valid, getting an auth token for it will no longer work
Problem 2: If we use the dataset / org actual name (not the old one saved in lfs_prefix), we will be rejected by Giftless because it has the resource stored based on lfs_prefix.

Potential fixes:

~~Switching to store using UUIDs and not names~~: will only partially solve the problem, won't help if a dataset is moved between organizations
Writing custom resource authorization handler in ckanext-blob-storage that authorizes based on the actual resource, then provides a token for lfs_prefix - may work but not sure this is possible given current ckanext-authz-service API which may not let set a custom scope for a resource auth request. This will require decoupling the context dataset / organization package the scope is requested for, from the scope string itself in some way - that is some work on ckanext-authz-service.
Add support for object-specific scopes in Giftless and generate / use these kind of scopes in ckanext-blob-storage. For example, instead of the current: obj:my-org/my-dataset/*:read tokens that we use to get download access (in which my-org/mydataset come from lfs_prefix), we generate tokens that look like obj:<sha256>:read. This will need to be supported by Giftless first (this will require slight modifications to Giftless authorization code). Then add generation of such tokens in ckanext-blob-storage. This may also require some modification in ckanext-authz-service to make scope formats more flexible, as with 2 above. I kind of prefer this to 2 as we'll end up with slightly cleaner scopes. Downloading will still require us to keep lfs_prefix and use it for download batch requests (but not in the JWT token).
Do 3 but also do away with the hierarchical storage structure in Giftless entirely (or at least make it optional), so that all objects are accessible without lfs_prefix and just require sha256 + size. This will be the cleanest solution but will require the most refactoring on all of giftless, ckanext-blob-storage and ckanext-authz-service. Benefits: no need to keep lfs_prefix around, as long as an object's sha256 doesn't change you can read it (if it changes it's not the same object...). Download scopes will need to be for a specific sha256. Upload tokens - need to do more analysis but probably you can always upload (assuming you have write access to anywhere in CKAN). Overwriting objects is not possible with Giftless anyway ("should not happen":tm:) This also adds the benefit of de-duplicating uploads across all objects, not just if they happen to share organization / dataset. This requires some deeper analysis but is most likely the cleanest, most robust but also more expensive solution.

shevron commented 3 years ago

From my further analysis, (3) and (4) above are not trivial to implement, as they require substantial changes in ckanext-authz-service and the way scopes are coupled to entities.

Because a sha256 is not something we can look up a resource by (currently it is an "extra" attribute and not indexed), and the authorization functions will need to somehow fetch the resource to check if the user has access or not, we'll need to find a way to parse obj:<sha256> into something we can use to fetch and check resource permissions against.

One way would be to create a DB table mapping sha256 to resources (1->M relationship!). Seems too much of an overhead.
Another solution is to refactor ckanext-authz-service significantly to decouple the entity you ask permissions for from the scope you get. Doing it in a clean and flexible manner would probably be a major effort.
However, perhaps a quick hack can be done by utilizing the already built-in mechanism of scope_normalizers in Authzzie which can normalize granted scopes. More on this approach down below.

Scope Normalizer based Quick Fix

Assumption: a scope normalizer function registered in ckanext-blob-storage for obj scopes can mangle requests for res:<org>/<dataset>/<sha256>:read to something like res:*/*/<sha256>.
If this is true, we can:
- Fix up Giftless to accept such scopes and only check the sha256 (most likely quick)
- Fix up ckanext-blob-storage and all relevant JS code handling downloads (if any) to include the sha256 in the scope auth request and ensure this input format is accepted and scope is granted
This should work around this problem. It means that:
- We continue to rely on lfs_prefix to send the batch request but not to get the auth token when downloading
- Uploads will continue to work as now

shevron commented 3 years ago

The scope normalizer based quick fix seems to be solid, and I'm moving on with that. Some minor changes were required in Giftless (see https://github.com/datopian/giftless/pull/61), so Giftless under 0.3.0 will not work with this fix.

datopian / ckanext-blob-storage