Open wshands opened 7 years ago
@wshands @GPelayo , It was my understanding that the original design was to upload the metadata first and then upload the file to the storage system, supposedly because it was a better outcome to have a metadata file uploaded with no accompanying file, than a file with no accompanying metadata, and therefore no way to trace it in the storage system. This was my understanding but I may be wrong. You can imagine a lot of garbage accumulating over time and therefore increasing production's costs. @briandoconnor might be a better resource to answer that.
Yes also the dcc-metadata-client produces the upload manifest used by the icgc-storage-client which does the upload in dockstore tool runner...so the metadata upload is wired to happen before the result file upload?
More discussion needed on this
@wshands , the dockstore-tool-runner first does the registration of the file using the dcc-metadata-client (an icgc tool to talk to the metadata server of redwood). Upon successful registration of the data in redwood, the dockstore-tool-runner then runs the icgc-storage-client tool to do the actual upload.
The metadata specifying result files produced from a pipeline is currently uploaded before the result files are uploaded to the storage system. These steps should be reversed, since it is more likely that the upload of result files will fail, and if the upload of result files fails, we will have metadata indicating a result file is in the storage system when it is not, and when this metadata is used to locate files to download, and a download of the missing file is attempted, the download fails. In addition the browser will display the details for the missing file that is not actually in the storage system. However if the result files are uploaded before the metadata is uploaded, and the result file upload fails, the upload will stop and metadata will not be uploaded for the pipeline results. Also if the metadata upload fails, which is unlikely, the result files will exist in the storage system but the user will simply not know about them, the browser will not know about them and the pipeline will simply need to be rerun.