[Epic] Purge datasets - Githubissues

Originally comming from here https://github.com/datahq/datahub-qa/issues/130

As a Publisher I want to permanently delete (purge) a data package so that it no longer takes up storage space.

Acceptance Criteria

[ ] data purge my-dataset permanently deletes data from datahub.
[ ] before delete asks me to type dataset name
[ ] available button on showcase or publisher page?

Tasks

[x] Do analysis
[ ] Serve Delete API in specstore
- [ ] Authorize
- [ ] Delete from DB
- [ ] Delete From S3
- [ ] Delete from Elasticsearch
- [ ] Tests
[ ] Make request from CLI
- [ ] Make sure to force user typing dataset name
- [ ] Grab token from auth and make request to specstore
- [ ] tests
[ ] Test purge command via BB test for assembler (optional, but desirable)

Analysis

The Web API should do the least amount of work so that the data appears deleted:

[ ] Does not appear in search
[ ] Links stop working (i.e. return 404)
[ ] Showcase returns 404
[ ] Storage space appears to the user as reclaimed

However, no data needs to be deleted in the API handler, just marked as deleted.

Later, a cron job that will run every hour/day/week will do the actual deletion of data from ES/S3.

Why?

Doing everything in an API handler could take too much time. It will work for small datasets, but for datasets with lots of files and revisions it will take too long, the handler will timeout and we'll never know about it.
People make mistakes. This approach allows us to potentially recover deleted data.
A cronjob will allow us to do things more thoroughly, e.g. looking for other users of a rawstore file, without the pressure of time.

Questions

Sounds much better, just have couple of questions [name=irakli]

Q: How exactly marking of files as deleted works?

Mark the ES document as "deleted"?
- Add {deleted: true} and actually update document on ELasticSearc?
  
  Yep [name=adam]
Mark datapackage.json as "deleted"?
- Add {deleted: true} Actually update datapackage.json on S3?
  
  Nope [name=adam]
How filemanager knows size has been reduced and by how much?
- Sum the bytes from dp.json?
  
  We mark the records in Filemanager as 'candidate for deletion', and make sure FM methods know to ignore these lines . FM already knows the bytesize of each file [name=adam]

Q: What about the revisions, how we can mark them as deleted - update all of them? If no they will be accessible in web, right?

Yes, public and unlisted pkgstore links will still work until S3 is deleted. I think that's fine. Private datasets shouldn't be accessible (as there's no way to get the private links). Thought: perhaps change to private before 'deleting'?

Q: What if user wants to push right after deletion. Will it be pushed as revision 1? Eg: I have 15 revision of garbage and the final one looks good. So I want to get rid of everything and re push one perfect revision.

Good question. I think that we can return an error in this case ('Dataset is marked for deletion and cannot be updated at this time... try again in once the dataset is fully purged.') }}

datahubio / datahub-v2-pm

[Epic] Purge datasets #123

Acceptance Criteria

Tasks

Analysis

Questions