Don't require entire dataset on local disk at same time

benjaminran commented 8 years ago

The client assumes all data listed in the input metadata.tsv is stored locally on disk and processes it in one big pass; for large dataset uploads the uploader currently has to split the master metadata.tsv into pieces so that they can download one batch of data to the spinnaker client machine's disk, run the client, remove the data, then repeat.

I haven't thought a lot about the best way to fix this, but one good solution would be to add a -r/--rows program option that accepts a list or range of row numbers (e.g. -r 1,2,3 or -r 1-3) and only upload the specified files. Then metadata.tsv won't need to be split up. Discussion welcome.

e-t-k commented 8 years ago

A similar, slightly more dangerous solution: add an --ignore-missing-files (eg) flag that, when invoked, skips uploading rows if the data isn't found on disk.

Both of these options do leave it up to the user to keep track of when they've gotten all the files associated with a "master" metadata.tsv uploaded into the system.

In either case, the metadata.tsv that is uploaded along with the selected rows should (I'm open to argument) only contain the metadata for those rows, not the entire original document.

Also, I've referenced this issue in the dcc-spinnaker-client repo, since this dcc-spinnaker repo is for the validation / state tracking server instead of for the client.

benjaminran commented 8 years ago

Thanks. This issue was meant for the client repo; it can be closed here and further discussion can be moved to the client repo.

BD2KGenomics / dcc-spinnaker

Don't require entire dataset on local disk at same time #7