Add genome sequences to the repository

lukasjelonek commented 1 year ago

During the first initialization of the repository, the genome fasta files where accidentally left out. They need to be uploaded as well. A good moment to include them would be when the metadata will be included, see #51

lukasjelonek commented 1 year ago

External disussions revealed that the genome fasta files are not part of the workflow results. Nevertheless this data should be available for download.

I see multiple options:

rerun workflow with genome fasta output enabled
extract the fasta sequences from the gff3 files
generate the fasta sequences from the bakta-json files
generate the fasta sequence from the gbff files
reference to the original assembly sequences at ena: http://ftp.ebi.ac.uk/pub/databases/ENA2018-bacteria-661k/Assemblies/

The extraction and generation options will require us to implement the functionality in the website and in the download tool. The reference to the original source will require the possibility to add links to the datasets on the server side. The dataset is already designed to contain arbitrary links, but at the moment it is not possible to create them via the api or the upload tool. So this feature must be implemented.

I prefer the last option, although it may be require more programming on the server side, it should reduce the overall amount of work of either rerunning the whole analysis + upload or implementing extraction of data at two places.

lukasjelonek commented 1 year ago

Tasks

[x] Implement "add external link to dataset" on server
[x] Generate mapping of dataset -> genome fasta url
[x] Check if it is possible to add md5-sums to external links. If not allow results without md5sums
[x] Implement add link action or parameter to upload cli

lukasjelonek commented 1 year ago

The dataset -> assembly-url-Mapping is available in <projectvolume>/upload/assemblies/assembly-urls.tsv

As we have all assemblies downloaded, it should be possible to compute the md5 sums and sizes on the local copies.

lukasjelonek commented 1 year ago

The implementation part is completed. Now the data has to be further processed and uploaded.

[x] combine assembly urls, md5 sums and assembly file sizes to json documents
[x] update all entries

lukasjelonek commented 1 year ago

The genome sequences have been uploaded, but unfortunately with http urls instead of https urls. This results in security warnings when trying to download the file in the browser. The ENA ftp site is also available via https, so it should be sufficient to reupload all assembly-urls with the https scheme.

TODO

[x] recreate all assembly references documents
[x] reupload all entries

ag-computational-bio / bakrep-web

Add genome sequences to the repository #55