Support of resource download from past activities (revisions)

datopian / ckanext-blob-storage

CKAN extension to offload blob storage to cloud storage providers (S3, GCS, Azure etc).

http://tech.datopian.com/blob-storage/

MIT License

14 stars 6 forks source link

Support of resource download from past activities (revisions) #55

Closed tomeksabala closed 3 years ago

tomeksabala commented 3 years ago

I was trying to decorate blob-storage download blueprints to support accessing giftless files for arbitrary activity_id of resource history. In other words to support the following arg activity_id:

/dataset/<dataset_id>/resource/<resource_id>/download/<filename>?activity_id=<activity_id>
http://ckan/dataset/xxx/resource/yyy/download/tomek.txt?activity_id=zzz

I thought that it's going to be an easy job as IResourceDownloadHandler.resource_download accept dataset and resource in a dictionary form. I have created a simple wrapper which replace the resource dictionary with the old version fetched from activity.

It wouldn't work due to authentication to giftless which bypasses dataset/resource dicts and work only with dataset_id and resource_id. This means that in the end of the day when we receive granted scopes translated to sha256 we get the hash of current resource version ("latest") which doesn't match requested object sha256 which comes from the activity_id.

I hope I did manage to shade some light on the issue. I have also added some inline comments to this PR. This PR is just a rough PoC. This allowed me to successfully download different versions of a single resource from Activity Stream. I would like to ask for early feedback if my approach is valid or otherwise potential suggestions how this feature could be implemented differently. CC: @shevron @rufuspollock

pdelboca commented 3 years ago

Hey @tomeksabala! I'm working on similar approaches so it's good to have some implementation to discuss.

ckanext-blob-storage is a plugabble storage layer and I think it should be agnostic from the logic layer of activities. Example: activity_id is not an LFS concept, so passing an activity_id parameter to get_lfs_download_spec will force to introduce CKAN's concepts into a method whose logic should be only to talk to an LFS server.

In a nutshell, ckanext-blob-storage demands size and sha256 to return a download specs and it shouldn't be aware and shouldn't matter about which activity has it or how to get them. I guess (IMHO) that this implementation needs to be moved to an extension that is dealing with activities itself (or to Core logic to some extend).

I'm working on ckanext-versions this week and gonna implement this feature there. My idea was to implement what you implemented in _get_resource_storage_id but in ckanext-versions download blueprint.

Implementation:

Pass activity_id as an extra parameter to the download blueprint
Get the activity if required
Send to blob-storage the size and sha256 values on that activity.

Let me know your thoughts! :)

shevron commented 3 years ago

From reading the problem description this feels a lot like something that should be fixed on the authz token generation level, and not elsewhere. I tend to agree with @pdelboca's comments on this, and it makes sense to work towards a common solution. I haven't reviewed the problem deeply but plan to do so and then I can comment from a more informed standpoint.

tomeksabala commented 3 years ago

Thanks for a replay. Your comments are very helpful. @pdelboca I'm working with the ckanext-versions too in order to enable dateset level resource versioning in CKAN.

@shevron I feel like the issue is that the authz logic is divided between authz-service and blob-storage extensions. The root cause of my troubles is the storage id generating function _get_resource_storage_id which is part of blob-storage external-storage/authz.py. Maybe a good solution would be to pass the whole resource dict through blob-storage, instead of just resource_id (which does not include the whole information once we deal with history versions).

Anyway, looking forward for more comments from you. Happy to answer any questions or give more clarifications.| Thanks again.

shevron commented 3 years ago

@tomeksabala I think your last suggestion is probably the "best approach" as it would also be future proof to an extent, however I need to look further to understand how feasible it is. I will have more time for this later this week.

pdelboca commented 3 years ago

I have just arrived to the same endpoint of @tomeksabala. I will add more info (or at least reframe the problem)

In a granular level the issue is the following: ckanext-blob-storage can write multiple files into one "folder" (if I edit the resource and upload a new file), but it will always return a normalized storage id for the file matching the latest/current resource since it uses the sha256 of the current resource.

https://github.com/datopian/ckanext-blob-storage/blob/29ce96548018a01b83dc18f45cb85b678d9c8c9d/ckanext/blob_storage/authz.py#L66-L82

So even when I can upload multiple files to the same org/dataset/* folder, I would never be able to access all of them if I built the normalized storage id calling to package_show.

An extra idea: Given that CKAN manages it's permissions at a dataset level, can we generate a normalized scope that will grant access to all files in org/dataset/*? What would be a case where a user can download only one of the objects in that path?

I will put more thoughts on this issue the next few days.

shevron commented 3 years ago

@pdelboca I think your idea makes sense and I had the same thought. I didn't have much time to actually work on the code (hopefully I will today or tomorrow), but the solutions I see to this boil down to:

(1) Making substantial changes to ckanext-authz-service and it's underlying library (authzzie) to allow passing more than just the ID of a resource. This feels like a very major change, as the whole design of authzzie is that from a resource ID in some format and some bound callbacks we can decide what permissions to grant. It wasn't designed to accept more complex input such as the whole dataset dict. So I don't like this one very much.

(2) Follow the path you suggest and normalize auth tokens for org/dataset/* - in the scope of CKAN, I don't currently see a case where we would want to grant permission to one resource in a DS but not the other. I think this was something I looked into in the past, I don't remember how hard it is but I will try to look into it.

(3) Change the bound ID parser and granting methods of a CKAN resource / dataset to accept a more complex resource ID, that can also optionally include a revision / activity ID. Maybe somethink like:

obj:org/dataset/foobar - a reference to resource foobar
obj:org/dataset/foobar?activity=bazbaz - a reference to resource foobar in a given revision / activity ID.

We then also change the permission check callback to accept, parse and handle such extra parameters.

I have not checked how feasible this is, but it can also work.

Would love to hear your additional thoughts, and as I said I will try to find the time today / tomorrow to look deeper into this problem.

pdelboca commented 3 years ago

I'm looking at this today and I have some extra thoughts. I had another priorities this week, but will try to finally came up with an implementation next week.

I overall think that @tomeksabala approach is a good quick one. The main smell I detect is pushing activity_id down in a logic that should only know about lfs servers stuff. What I mean: _get_resource_storage_id(...) shouldn't (IMHO) deal with how to get the lfs attributes in a CKAN's resource dictionary.

Using activity_id as an extra parameter has some extra issues like: a) duplicate the call to activity_show because down in the pipe we lost some of the attributes, and b) do some patches to support resource_id/activity_id.

If the storage management service is in fact a Gift LFS Server, maybe it makes sense the API of the main method: get_resource_download_spec() to directly receive lfs_prefix, sha256, oid and the necessary IDs to validate permissions. This way we can create a good separation between CKAN and LFS Server logics.

Long story short: cascading down the lfs attributes may be a good way improve to this PR (instead of activity_id).

pdelboca commented 3 years ago

Ok, one extra comment. Because with some minor changes I get a functioning implementation using @shevron idea number (2)

Details

Implement a custom download blueprint (or use pre_resource_download) to get the resource in the given activity if activity_id is present in the request.
If activity_id, we have the resource_dict with all the metadata to create the lfs request in _get_resource_download_lfs_objects to point to the specific file we need (oid and size)
Modify the _get_resource_storage_id to return lfs_prefix/* (given that for each dataset, we have access to all resources in it)

Some pseudo implementation

def pre_resource_download(self, resource, package):
    ''' Returns the resource in an old activity object if activity_id is 
    in the request to the download blueprint.
    '''
    activity_id = toolkit.request.params.get('activity_id', None)

    if activity_id:
        activity = toolkit.get_action(u'activity_show')( {u'id': activity_id, u'include_data': True})
        activity_dataset = activity['data']['package']
        activity_resources = activity_dataset['resources']
        for r in activity_resources:
            if r['id'] == resource_id:
                  resource = r
                  break
    else:
        dataset = toolkit.get_action('package_show')(context, {'id': dataset_id})
        for res in dataset['resources']:
            if res['id'] == resource_id:
                resource = res

    return resource

The following is the same function we have right now but hardcoded to return <lfs_prefix>/*. (Of course this should be properly coded)

def _get_resource_storage_id(organization_id, dataset_id, resource_id):
    # type: (str, str, str) -> str
    """Get the exact ID of the resource in storage, as opposed to it's ID in CKAN

    A resource in CKAN is identified by <org_id>/<dataset_id>/<resource_id>. However,
    the <org_id> and <dataset_id> can change if the dataset is renamed or moved or the
    organization is renamed. For this reason, we replace the original CKAN ID with a
    "static" ID composed if <lfs_prefix>/<sha256>. <lfs_prefix> in turn is the
    <org_id>/<dataset_id> that the dataset had *when it was originally uploaded*, and
    does not change over time.
    """
    dataset = toolkit.get_action('package_show')(get_user_context(), {'id': dataset_id})
    for res in dataset['resources']:
        if res['id'] == resource_id and res.get('sha256') and res.get('lfs_prefix'):
            return '{}/{}'.format(res['lfs_prefix'], '*')

    return '{}/{}/{}'.format(organization_id, dataset_id, resource_id)

This 3 steps should work. I'm implementing an MVP of this in https://github.com/datopian/ckanext-versions/pull/22 and so far it works with the hardcoded *.

pdelboca commented 3 years ago

@tomeksabala I think this approach is good to be included as an early implementation and we can keep discussing better approaches later on. Any extra thoughts do you have? I think we can merge but first we need to do some minor changes.

This PR is not in sync with master given that it is still called external_storage
I also left some feedback about minor changes in the code.
Would be nice to have some tests for this new feature.

tomeksabala commented 3 years ago

@pdelboca Thanks a lot for loads of useful comments and information. Sorry for a later response, I was on holidays and only came back to work yesterday. Regarding @shevron idea (2) of passing wildcard permission to all resources in a dataset: it might not be a good idea as we permanently loose ability for more granular access at the resource level. Core CKAN doesn't support this but there are some extensions enabling this feature (my team uses ckanext-restricted).

If I understood your most recent comments @pdelboca we're good to go with solution (3) (passing down more complex resource_id including activity_id). I'll try to rebase this PR on most recent master branch and include your comments to the code today.

Thanks again!

tomeksabala commented 3 years ago

@pdelboca I'm sorry for a delay on this from my end. I needed to focus on other work around my project. The work from this PR is still relevant and I plan to get back to this next week.

pdelboca commented 3 years ago

@tomeksabala I wen ahead and created a new and clean PR to implement this + some tests. (it was easier to start from a clean new PR from master than fixing the conflicts).

Would you mind take a look at it and test it? https://github.com/datopian/ckanext-blob-storage/pull/57

For some reason it's not working with the latest merges in ckanext-authz-service (didn't take a look at it) but it's working as expected with ckanext-authz-service==v0.1.5.

pdelboca commented 3 years ago

Closing this PR. This was implemented in #57