Closed lukasjelonek closed 1 year ago
External disussions revealed that the genome fasta files are not part of the workflow results. Nevertheless this data should be available for download.
I see multiple options:
The extraction and generation options will require us to implement the functionality in the website and in the download tool. The reference to the original source will require the possibility to add links to the datasets on the server side. The dataset is already designed to contain arbitrary links, but at the moment it is not possible to create them via the api or the upload tool. So this feature must be implemented.
I prefer the last option, although it may be require more programming on the server side, it should reduce the overall amount of work of either rerunning the whole analysis + upload or implementing extraction of data at two places.
Tasks
The dataset -> assembly-url
-Mapping is available in <projectvolume>/upload/assemblies/assembly-urls.tsv
As we have all assemblies downloaded, it should be possible to compute the md5 sums and sizes on the local copies.
The implementation part is completed. Now the data has to be further processed and uploaded.
The genome sequences have been uploaded, but unfortunately with http urls instead of https urls. This results in security warnings when trying to download the file in the browser. The ENA ftp site is also available via https, so it should be sufficient to reupload all assembly-urls with the https scheme.
TODO
During the first initialization of the repository, the genome fasta files where accidentally left out. They need to be uploaded as well. A good moment to include them would be when the metadata will be included, see #51