datahubio / datahub-v2-pm

Project management (issues only)
8 stars 2 forks source link

[EPIC] Private Datasets #59

Closed zelima closed 6 years ago

zelima commented 6 years ago

As a Publisher i want to have my dataset be private when I publish it so that no-one else can access it but me

As a Publisher I want to make my private dataset public (or unlisted) so that I can share it with others

As a Publisher I want to make public (or unlisted) dataset private so that it is hidden

General

Acceptance Criteria

Tasks

Questions:

Analysis

Phase I - Change dataset privacy via CLI only

data push --findability=private

CLI

We need to know that dataset is private: https://github.com/datahq/datahub-cli/blob/master/lib/utils/datahub.js#L152

// CLI
const body = {
      metadata: {
        owner: this._ownerid,
        findability:  this._findability
      },
      filedata: fileData
    }

const res = await this._fetch('/rawstore/authorize', token, {
      method: 'POST',
      body
    })

Bitstore

Put raw data on S3 under private ACL (according to findability sent by CLI)

# Bitstore
def authorize(auth_token, req_payload):
    owner = req_payload.get('metadata', {}).get('owner')
    findability = req_payload.get('metadata', {}).get('findability', 'unlisted')

    s3headers = {
        'acl': 'private' if findability=='private' else 'public-read',
        'Content-MD5': file['md5'],
        'Content-Type': file.get('type', 'text/plain')
            }
    post = s3.generate_presigned_post(
        Bucket=config['STORAGE_BUCKET_NAME'],
        Key=s3path,
        Fields=s3headers)
  ...

Q: how can assembler read data from bitstore when it is private?

A: Think we should serve the API that takes URL, JWT and owner as query parameters (or send URL and owner via body and token as header) and returns signed URL (that will be used in source-spec)

Eg: api.datahub.io/rawstore/checkurl?url=https://rawstore.datahub.io/core/finance-vix&owner=core&jwt=token

def check_url(token, owner, url):
  needs_signed_url = requests.get(url)
  if needs_signed_url.status_code != 403:
    return {url: url}
  if owner is None:
    return Response(status=400)
  if not services.verify(token, owner):
    return Response(status=401)
  parsed_url = requests.utils.urlparse(url)
  bucket = parsed_url.netloc
  key = parsed_url.path
  signed_url = s3.generate_presigned_url(
    ClientMethod='get_object',
    Params={
        'Bucket': bucket,
        'Key': key
    }
  return {url: signed_url}

Assembler/planner

We need to export processed output on s3 (pkgstore) with private ACL as well https://github.com/datahq/planner/blob/master/planner/utilities.py#L24

# planner
...
acl = 'private' if findability=='private' else 'public-read',
dump.to_s3,
   {
       'force-format': False,
       'handle-non-tabular': handle_non_tabular,
       'add-filehash-to-path': True,
       'bucket': os.environ['PKGSTORE_BUCKET'],
       'path': '/'.join(str(p) for p in parts),
       'acl': acl
   }
  ...

Q: how can assembler list the object in pkgstore if it's private?

A: it can't.

Solution: Maybe instead of setting ACL while dump.to_s3 we can dump them as public and add new processor (probably in dpp-aws) that will be run after all processors are executed, list all keys for dataset and changes ACL to private

# new processor
from datapackage_pipelines.wrapper import process

def modify_datapackage(dp, parameters, stats):
  s3 = boto3.client('s3')
  bucket = parameters['bucket']
  key = parameters['path'] # aka dataset
  acl = parameters['acl']
  objs = s3.list_objects_v2(Bucket=bucket, Prefix=key)
  for obj in objs:
    s3.put_object_acl(
    Bucket=bucket,
    Key=obj,
    ACL=acl)
  return dp

process(modify_datapackage=modify_datapackage)

Serve API for pkgstore signed urls

We need to generate Pre-signed URLs and probably serve new API for user/frontend to be able to read private objects.

We can create new service or maybe worth to refactor bitsotre to accept service (rawstore/pkgstore) and reuse rawstore/checkurl

# New service/API Eg api.datahub.io/pkgstore

def get_signed_url(token, url):
  authenticate()
  bucket = url.get_base().replace('https://', '')
  key = url.get_tail()
  signed_url = s3.generate_presigned_url(
    ClientMethod='get_object',
    Params={
        'Bucket': bucket,
        'Key': key
    }
  return signed_url
)

Frontend

To get the datapackage.json for showcase page we need to generate signed URL for it - we can use API bitstore API (or new service from above)

Besides that URLs inside that dp.json for resource are private as well so we need to generate signed urls for them as well. Again we can reuse API from above to get signed URLs on demand.

https://github.com/datahq/frontend/blob/master/lib/index.js#L82

async getPackageFile(ownerid, name, path = 'datapackage.json') {
    const url = urllib.resolve(this.bitstoreUrl,
        [ownerid, name, 'latest', path].join('/')
        )
    const signed_url = await fetch(`api.datahub.io/pkgstore/signe_url?url={url}`)
    const response = await fetch(signed_url)
    return response
  }

Same logic when requesting resource urls (preview, download, other...)


Below is obsolete

~~* new processor in dpp-aws update_with_signed_url

from datapackage_pipelines.wrapper import process

def modify_datapackage(dp, parameters, stats): for resource in dp[resources]: resource['path'] = api.datahub.io/pkgstore/signe_url?url=resource['path'] return dp

process(modify_datapackage=modify_datapackage)



Q: What to do about URLs getting expired in max 7 days?
A: this is not any more problem as presigned urls are generated on demand (on every request)
akariv commented 6 years ago

Review - need to address these issues:

Only missing thing (I think):