Follow up to #45: the current blob storage approach has an issue when one moves a dataset from one organization to another (or the dataset is renamed). This is because we are storing data in blob storage at {org}/{dataset-name}/ and using the information when performing scope validation in giftless.
To avoid this, it is proposed to use <static-prefix>/<dataset-UUID> as the LFS prefix when storing new resources.
The <static-prefix> part is set in config for the whole CKAN site. Thus is not technically essential, but having it will allow us to not make any modifications to Giftless, which expects a two-part (org name / repo name) prefix.
Using dataset-UUID instead of dataset-name means the resource's container prefix will not need to be rewritten / mangled if the dataset is renamed or moved to a different organization, or if the organization is renamed.
This will allow us to drop the Scope Normalizer solution all together as scopes will always be <static-prefix>/<dataset-UUID>/<sha256>.
As a compatibility measure, we can:
Keep lfs_prefix for now
If lfs_prefix is set to something that doesn't look like the new prefix format, still go through scope normalization
Run migration to move all in-storage objects to new format containers
Stop using lfs_prefix all together, as it is not needed anymore (although we don't have to and it can be beneficial at some point, e.g. if an additional change will be required).
Tasks
[x] Change ckanext-blob-storage to use static-prefix/uuid as prefix when uploading ~2d
[x] Need to decide if we keep lfs_prefix around or not. If we do not, need to flag resources that are in Git LFS in some other way (or rely on sha256 being set as the indicator)
[x] Static prefix should be config based
[x] Token handling - can be by registering a new auth handler, by using a (different) scope normalizer logic (probably easiest) or by making some adjustments to ckanext-authz-service.
[x] Upload location - probably an easy change
[x] Ensure backwards compatibility with already-migrated resources ~1d
e.g. by dealing with lfs_prefix in scope normalizer - not needed if we don't need BC e.g. can do migration during downtime.
[x] Write and run migration script to move resources from name-based LFS prefix to UUID based - ~2d
[ ] Deployment and testing ~1-2d
Analysis
What's the problem
Imagine i want to download the blob related to a resource ...
I get the resource metadata
I go to ckanext-authz endpoint and say: given me a token to read a resource
The token I ask for will contain the scope: obj:myorg/mydataset/*:read - to read every resource of myorg/mydataset.
I take that token to giftless and given the token along with XXX
How does giftless know whether it should grant access?
It looks at the request storage object with identifier:
POST to /myorg/myrepo/object/batch with {oid: <sha256>} - this identifies the object and can be checked against the scope in the token.
Can i do POST to /{prefix}/object/batch with {oid: <sha256>} and prefix can be
and compares it with the provided scopes ...
How does it know that storage object is covered by the scope? A scope accepted by Giftless looks something like obj:myorg/mydataset/*:read
Giftless gives me a token for the storage (a url)
Where this goes wrong is if i have moved the dataset ... because now the giftless location is still old dataset whilst scope is for new dataset ...
Options
Flat namespace /{sha256} in storage space that is pure content addressed
Scoped storage with entity UUID: /{dataset-uuid}/sha256Preferred
Relocate data ...
Temporary solution
Quick Fix: Scope Normalizer based Quick Fix - DONE in #47
Change from obj:myorg/myrepo/sha256:read to obj:*/*/sha256:read or even obj:sha256:read
Assumption: a scope normalizer function registered in ckanext-blob-storage for obj scopes can mangle requests for res:<org>/<dataset>/<sha256>:read to something like res:*/*/<sha256>.
If this is true, we can:
Fix up Giftless to accept such scopes and only check the sha256 (most likely quick)
Fix up ckanext-blob-storage and all relevant JS code handling downloads (if any) to include the sha256 in the scope auth request and ensure this input format is accepted and scope is granted
This should work around this problem. It means that:
We continue to rely on lfs_prefix to send the batch request but not to get the auth token when downloading
Follow up to #45: the current blob storage approach has an issue when one moves a dataset from one organization to another (or the dataset is renamed). This is because we are storing data in blob storage at
{org}/{dataset-name}/
and using the information when performing scope validation in giftless.To avoid this, it is proposed to use
<static-prefix>/<dataset-UUID>
as the LFS prefix when storing new resources.<static-prefix>
part is set in config for the whole CKAN site. Thus is not technically essential, but having it will allow us to not make any modifications to Giftless, which expects a two-part (org name / repo name) prefix.dataset-UUID
instead ofdataset-name
means the resource's container prefix will not need to be rewritten / mangled if the dataset is renamed or moved to a different organization, or if the organization is renamed.<static-prefix>/<dataset-UUID>/<sha256>
.lfs_prefix
for nowlfs_prefix
is set to something that doesn't look like the new prefix format, still go through scope normalizationlfs_prefix
all together, as it is not needed anymore (although we don't have to and it can be beneficial at some point, e.g. if an additional change will be required).Tasks
static-prefix/uuid
as prefix when uploading~2d
lfs_prefix
around or not. If we do not, need to flag resources that are in Git LFS in some other way (or rely onsha256
being set as the indicator)ckanext-authz-service
.~1d
lfs_prefix
in scope normalizer - not needed if we don't need BC e.g. can do migration during downtime.~2d
Analysis
What's the problem
Imagine i want to download the blob related to a resource ...
obj:myorg/mydataset/*:read
- to read every resource of myorg/mydataset.POST
to/myorg/myrepo/object/batch
with{oid: <sha256>}
- this identifies the object and can be checked against the scope in the token./{prefix}/object/batch
with{oid: <sha256>}
and prefix can beobj:myorg/mydataset/*:read
Where this goes wrong is if i have moved the dataset ... because now the giftless location is still old dataset whilst scope is for new dataset ...
Options
/{sha256}
in storage space that is pure content addressed/{dataset-uuid}/sha256
PreferredQuick Fix: Scope Normalizer based Quick Fix - DONE in #47
Change from
obj:myorg/myrepo/sha256:read
toobj:*/*/sha256:read
or evenobj:sha256:read
Assumption: a scope normalizer function registered in ckanext-blob-storage for obj scopes can mangle requests for
res:<org>/<dataset>/<sha256>:read
to something likeres:*/*/<sha256>
.If this is true, we can: