ag-computational-bio / bakrep-web

The user interface for bakrep
1 stars 0 forks source link

Add command line tool to download a collection of (partial) bakrep datasets #50

Closed lukasjelonek closed 11 months ago

lukasjelonek commented 11 months ago

It should be possible to download a collection of datasets to the users computer via a commandline tool. It should be able to only download certain files for the datasets, e.g. only the protein fasta files. It should be possible to use the exported tsv to download all datasets present in the tsv file.

lukasjelonek commented 11 months ago

Cli examples

Download gff3 files for a tsv export

bakrep download -t bakrep-export.tsv -d /tmp/my-download-dir -m tool:bakta,filetype:gff3 

Download all files for a tsv export

bakrep download -t bakrep-export.tsv -d /tmp/my-download-dir -a

Download a specific dataset

bakrep download -l SAMD12345 -d /tmp/my-download-dir -a

Download multiple specific datasets

bakrep download -l SAMD12345,SAMD77777 -d /tmp/my-download-dir -a

How to download

To avoid resume of failed/canceled downloads the commandline tool should track persistently which datasets are already downloaded. On resume/next download it should identify the missing datasets and continue to only download these.

At the moment we will provide a naive download mechanism that does a lookup of files for each dataset via the bakrep rest api, filters the required files and saves them to disk.

I expect that the users do not want to download everything into a single directory. Depending on the volume it may be better to create subsets (subdirectories) with as much as n datasets. This may be computed dynamically depending on the tsv input or statically (the cli provides a hard coded directory schema, maybe the schema we use for storage internally).

In a first version downloading the download may be single threaded. In future versions this can be updated to use asyncio or multithreading.

How to display progress

The cli shall display the progress in a simple ${processed}/${total} schema that is updated everytime a dataset-download is finished. Additionally it may show the current dataset id that is downloaded. To notify the user about continuing the download, an additional line that state the number of already downloaded datasets should be added.

lukasjelonek commented 11 months ago

Components

DownloadSet

Manages all datasets that need to be downloaded.

Actions

State

BakrepDownloader

Downloads a dataset to a provided location

Actions

Events

State

None

Download validator

Validates the checksums of all files for a dataset

Actions

State

None

lukasjelonek commented 11 months ago

The tool is now available at https://github.com/ag-computational-bio/bakrep-cli.

lukasjelonek commented 11 months ago

Finished