Space management - Githubissues

GGasch commented 1 year ago

Currently every iteration of pcalf-dataset re-download all the genomes required. Which can add up pretty quickly in terms of space. A solution could be to add to the pcalf-dataset an option to allow the user to download the genomes to a specific directory. Then to check if the genomes that the users want to download are already present in the directory, and downloading only the new ones, thus saving time and space.

K2SOHIGH commented 1 year ago

Hello.

Indeed, it could be a solution.

There is already a feature to avoid re-download some genomes. You can use the --exclude parameter of the pcalf-dataset command. All you need is a file with genome accession you don't want to download.

GGasch commented 1 year ago

Hi ! I forgot this one, which is a good one btw. Maybe recall in the README.md the existence of the accession.txt file that have this information readily avaiblable. I will do that I think.

The other reason I was thinking about creating a dedicated directory for genomes was to make them readily avaible for other analysis. For instance I scanned genome with a calcyanin for the CoBaHMA domain, and having all the genomes from various pcalf runs at the same place would help this kind of scan.

I will try to fix a solution on this one, even tho I am not super familiar with SnakeMake

K2SOHIGH / pcalf

Space management #11