Closed tomeksabala closed 3 years ago
Hey @tomeksabala! I'm working on similar approaches so it's good to have some implementation to discuss.
ckanext-blob-storage
is a plugabble storage layer and I think it should be agnostic from the logic layer of activities. Example: activity_id
is not an LFS concept, so passing an activity_id
parameter to get_lfs_download_spec
will force to introduce CKAN's concepts into a method whose logic should be only to talk to an LFS server.
In a nutshell, ckanext-blob-storage
demands size
and sha256
to return a download specs and it shouldn't be aware and shouldn't matter about which activity has it or how to get them. I guess (IMHO) that this implementation needs to be moved to an extension that is dealing with activities itself (or to Core logic to some extend).
I'm working on ckanext-versions
this week and gonna implement this feature there. My idea was to implement what you implemented in _get_resource_storage_id
but in ckanext-versions
download blueprint.
Implementation:
activity_id
as an extra parameter to the download blueprintsize
and sha256
values on that activity.Let me know your thoughts! :)
From reading the problem description this feels a lot like something that should be fixed on the authz token generation level, and not elsewhere. I tend to agree with @pdelboca's comments on this, and it makes sense to work towards a common solution. I haven't reviewed the problem deeply but plan to do so and then I can comment from a more informed standpoint.
Thanks for a replay. Your comments are very helpful.
@pdelboca I'm working with the ckanext-versions
too in order to enable dateset level resource versioning in CKAN.
@shevron I feel like the issue is that the authz logic is divided between authz-service and blob-storage extensions. The root cause of my troubles is the storage id generating function _get_resource_storage_id
which is part of blob-storage external-storage/authz.py
.
Maybe a good solution would be to pass the whole resource dict through blob-storage, instead of just resource_id
(which does not include the whole information once we deal with history versions).
Anyway, looking forward for more comments from you. Happy to answer any questions or give more clarifications.| Thanks again.
@tomeksabala I think your last suggestion is probably the "best approach" as it would also be future proof to an extent, however I need to look further to understand how feasible it is. I will have more time for this later this week.
I have just arrived to the same endpoint of @tomeksabala. I will add more info (or at least reframe the problem)
In a granular level the issue is the following: ckanext-blob-storage
can write multiple files into one "folder" (if I edit the resource and upload a new file), but it will always return a normalized storage id for the file matching the latest/current resource since it uses the sha256
of the current resource.
So even when I can upload multiple files to the same org/dataset/*
folder, I would never be able to access all of them if I built the normalized storage id calling to package_show
.
An extra idea: Given that CKAN manages it's permissions at a dataset level, can we generate a normalized scope that will grant access to all files in org/dataset/*
? What would be a case where a user can download only one of the objects in that path?
I will put more thoughts on this issue the next few days.
@pdelboca I think your idea makes sense and I had the same thought. I didn't have much time to actually work on the code (hopefully I will today or tomorrow), but the solutions I see to this boil down to:
(1) Making substantial changes to ckanext-authz-service and it's underlying library (authzzie
) to allow passing more than just the ID of a resource. This feels like a very major change, as the whole design of authzzie
is that from a resource ID in some format and some bound callbacks we can decide what permissions to grant. It wasn't designed to accept more complex input such as the whole dataset dict. So I don't like this one very much.
(2) Follow the path you suggest and normalize auth tokens for org/dataset/*
- in the scope of CKAN, I don't currently see a case where we would want to grant permission to one resource in a DS but not the other. I think this was something I looked into in the past, I don't remember how hard it is but I will try to look into it.
(3) Change the bound ID parser and granting methods of a CKAN resource / dataset to accept a more complex resource ID, that can also optionally include a revision / activity ID. Maybe somethink like:
obj:org/dataset/foobar
- a reference to resource foobar
obj:org/dataset/foobar?activity=bazbaz
- a reference to resource foobar
in a given revision / activity ID. We then also change the permission check callback to accept, parse and handle such extra parameters.
I have not checked how feasible this is, but it can also work.
Would love to hear your additional thoughts, and as I said I will try to find the time today / tomorrow to look deeper into this problem.
I'm looking at this today and I have some extra thoughts. I had another priorities this week, but will try to finally came up with an implementation next week.
I overall think that @tomeksabala approach is a good quick one. The main smell I detect is pushing activity_id
down in a logic that should only know about lfs servers stuff. What I mean: _get_resource_storage_id(...)
shouldn't (IMHO) deal with how to get the lfs attributes in a CKAN's resource dictionary.
Using activity_id
as an extra parameter has some extra issues like: a) duplicate the call to activity_show
because down in the pipe we lost some of the attributes, and b) do some patches to support resource_id/activity_id
.
If the storage management service is in fact a Gift LFS Server, maybe it makes sense the API of the main method: get_resource_download_spec()
to directly receive lfs_prefix
, sha256
, oid
and the necessary IDs to validate permissions. This way we can create a good separation between CKAN and LFS Server logics.
Long story short: cascading down the lfs attributes may be a good way improve to this PR (instead of activity_id
).
Ok, one extra comment. Because with some minor changes I get a functioning implementation using @shevron idea number (2)
download
blueprint (or use pre_resource_download
) to get the resource in the given activity if activity_id
is present in the request.activity_id
, we have the resource_dict
with all the metadata to create the lfs request in _get_resource_download_lfs_objects
to point to the specific file we need (oid
and size
)_get_resource_storage_id
to return lfs_prefix/*
(given that for each dataset, we have access to all resources in it)def pre_resource_download(self, resource, package):
''' Returns the resource in an old activity object if activity_id is
in the request to the download blueprint.
'''
activity_id = toolkit.request.params.get('activity_id', None)
if activity_id:
activity = toolkit.get_action(u'activity_show')( {u'id': activity_id, u'include_data': True})
activity_dataset = activity['data']['package']
activity_resources = activity_dataset['resources']
for r in activity_resources:
if r['id'] == resource_id:
resource = r
break
else:
dataset = toolkit.get_action('package_show')(context, {'id': dataset_id})
for res in dataset['resources']:
if res['id'] == resource_id:
resource = res
return resource
The following is the same function we have right now but hardcoded to return <lfs_prefix>/*
. (Of course this should be properly coded)
def _get_resource_storage_id(organization_id, dataset_id, resource_id):
# type: (str, str, str) -> str
"""Get the exact ID of the resource in storage, as opposed to it's ID in CKAN
A resource in CKAN is identified by <org_id>/<dataset_id>/<resource_id>. However,
the <org_id> and <dataset_id> can change if the dataset is renamed or moved or the
organization is renamed. For this reason, we replace the original CKAN ID with a
"static" ID composed if <lfs_prefix>/<sha256>. <lfs_prefix> in turn is the
<org_id>/<dataset_id> that the dataset had *when it was originally uploaded*, and
does not change over time.
"""
dataset = toolkit.get_action('package_show')(get_user_context(), {'id': dataset_id})
for res in dataset['resources']:
if res['id'] == resource_id and res.get('sha256') and res.get('lfs_prefix'):
return '{}/{}'.format(res['lfs_prefix'], '*')
return '{}/{}/{}'.format(organization_id, dataset_id, resource_id)
This 3 steps should work. I'm implementing an MVP of this in https://github.com/datopian/ckanext-versions/pull/22 and so far it works with the hardcoded *
.
@tomeksabala I think this approach is good to be included as an early implementation and we can keep discussing better approaches later on. Any extra thoughts do you have? I think we can merge but first we need to do some minor changes.
master
given that it is still called external_storage
@pdelboca Thanks a lot for loads of useful comments and information. Sorry for a later response, I was on holidays and only came back to work yesterday. Regarding @shevron idea (2) of passing wildcard permission to all resources in a dataset: it might not be a good idea as we permanently loose ability for more granular access at the resource level. Core CKAN doesn't support this but there are some extensions enabling this feature (my team uses ckanext-restricted).
If I understood your most recent comments @pdelboca we're good to go with solution (3) (passing down more complex resource_id including activity_id). I'll try to rebase this PR on most recent master branch and include your comments to the code today.
Thanks again!
@pdelboca I'm sorry for a delay on this from my end. I needed to focus on other work around my project. The work from this PR is still relevant and I plan to get back to this next week.
@tomeksabala I wen ahead and created a new and clean PR to implement this + some tests. (it was easier to start from a clean new PR from master than fixing the conflicts).
Would you mind take a look at it and test it? https://github.com/datopian/ckanext-blob-storage/pull/57
For some reason it's not working with the latest merges in ckanext-authz-service
(didn't take a look at it) but it's working as expected with ckanext-authz-service==v0.1.5
.
Closing this PR. This was implemented in #57
I was trying to decorate blob-storage
download
blueprints to support accessing giftless files for arbitrary activity_id of resource history. In other words to support the following argactivity_id
:I thought that it's going to be an easy job as
IResourceDownloadHandler.resource_download
accept dataset and resource in a dictionary form. I have created a simple wrapper which replace the resource dictionary with the old version fetched from activity.It wouldn't work due to authentication to giftless which bypasses dataset/resource dicts and work only with dataset_id and resource_id. This means that in the end of the day when we receive granted scopes translated to
sha256
we get the hash of current resource version ("latest") which doesn't match requested objectsha256
which comes from theactivity_id
.I hope I did manage to shade some light on the issue. I have also added some inline comments to this PR. This PR is just a rough PoC. This allowed me to successfully download different versions of a single resource from Activity Stream. I would like to ask for early feedback if my approach is valid or otherwise potential suggestions how this feature could be implemented differently. CC: @shevron @rufuspollock