Q: how can assembler read data from bitstore when it is private?
A: Think we should serve the API that takes URL, JWT and owner as query parameters (or send URL and owner via body and token as header) and returns signed URL (that will be used in source-spec)
# planner
...
acl = 'private' if findability=='private' else 'public-read',
dump.to_s3,
{
'force-format': False,
'handle-non-tabular': handle_non_tabular,
'add-filehash-to-path': True,
'bucket': os.environ['PKGSTORE_BUCKET'],
'path': '/'.join(str(p) for p in parts),
'acl': acl
}
...
Q: how can assembler list the object in pkgstore if it's private?
A: it can't.
Solution: Maybe instead of setting ACL while dump.to_s3 we can dump them as public and add new processor (probably in dpp-aws) that will be run after all processors are executed, list all keys for dataset and changes ACL to private
# new processor
from datapackage_pipelines.wrapper import process
def modify_datapackage(dp, parameters, stats):
s3 = boto3.client('s3')
bucket = parameters['bucket']
key = parameters['path'] # aka dataset
acl = parameters['acl']
objs = s3.list_objects_v2(Bucket=bucket, Prefix=key)
for obj in objs:
s3.put_object_acl(
Bucket=bucket,
Key=obj,
ACL=acl)
return dp
process(modify_datapackage=modify_datapackage)
Serve API for pkgstore signed urls
We need to generate Pre-signed URLs and probably serve new API for user/frontend to be able to read private objects.
We can create new service or maybe worth to refactor bitsotre to accept service (rawstore/pkgstore) and reuse rawstore/checkurl
To get the datapackage.json for showcase page we need to generate signed URL for it - we can use API bitstore API (or new service from above)
Besides that URLs inside that dp.json for resource are private as well so we need to generate signed urls for them as well. Again we can reuse API from above to get signed URLs on demand.
How will assembler be able to read data from rawstore if it’s private.
How will assembler be able to list objects in pkgstore if it’s private
Frontend - how will the user be able to download resources from pkgstore. As presigned urls have an expiry time, they cannot be part of the datapackage, but need to be generated on demand.
Probably should use the sign-url API to sign the resource urls as well.
Use the sign-url only when needed (i.e. dataset is private).
What about other ACLs (list objects in bucket) - need to make sure these are removed from bucket by default
Only missing thing (I think):
Add to planner/assembler a new processor to list all objects for dataset and change ACLs according to the latest findability settings.
As a Publisher i want to have my dataset be private when I publish it so that no-one else can access it but me
As a Publisher I want to make my private dataset public (or unlisted) so that I can share it with others
As a Publisher I want to make public (or unlisted) dataset private so that it is hidden
General
Acceptance Criteria
able to update dataset findability from web (ot other) UITasks
data push --findability=private
Questions:
Analysis
Phase I - Change dataset privacy via CLI only
dump.to_s3
data push --findability=private
CLI
We need to know that dataset is private: https://github.com/datahq/datahub-cli/blob/master/lib/utils/datahub.js#L152
Bitstore
Put raw data on S3 under
private
ACL (according to findability sent by CLI)Q: how can assembler read data from bitstore when it is private?
A: Think we should serve the API that takes URL, JWT and owner as query parameters (or send URL and owner via body and token as header) and returns signed URL (that will be used in source-spec)
Eg: api.datahub.io/rawstore/checkurl?url=https://rawstore.datahub.io/core/finance-vix&owner=core&jwt=token
Assembler/planner
We need to export processed output on s3 (pkgstore) with
private
ACL as well https://github.com/datahq/planner/blob/master/planner/utilities.py#L24Q: how can assembler list the object in pkgstore if it's private?
A: it can't.
Solution: Maybe instead of setting ACL while
dump.to_s3
we can dump them as public and add new processor (probably in dpp-aws) that will be run after all processors are executed, list all keys for dataset and changes ACL toprivate
Serve API for pkgstore signed urls
We need to generate Pre-signed URLs and probably serve new API for user/frontend to be able to read private objects.
We can create new service or maybe worth to refactor bitsotre to accept service (rawstore/pkgstore) and reuse
rawstore/checkurl
Frontend
To get the datapackage.json for showcase page we need to generate signed URL for it - we can use API bitstore API (or new service from above)
Besides that URLs inside that dp.json for resource are private as well so we need to generate signed urls for them as well. Again we can reuse API from above to get signed URLs on demand.
https://github.com/datahq/frontend/blob/master/lib/index.js#L82
Same logic when requesting resource urls (preview, download, other...)
Below is obsolete
~~* new processor in dpp-aws
update_with_signed_url
update_with_signed_url
from datapackage_pipelines.wrapper import process
def modify_datapackage(dp, parameters, stats): for resource in dp[resources]: resource['path'] = api.datahub.io/pkgstore/signe_url?url=resource['path'] return dp
process(modify_datapackage=modify_datapackage)