CDPHE-bioinformatics / CDPHE-SARS-CoV-2

Workflows and scripts for the assembly and analysis of SARS-CoV-2 whole genome tiled amplicon sequencing.
https://cdphe-bioinformatics.github.io/CDPHE-SARS-CoV-2/
GNU General Public License v3.0
5 stars 0 forks source link

[REQUEST] Do not overwrite existing files by default #68

Open sam-baird opened 2 weeks ago

sam-baird commented 2 weeks ago

Feature Request

Files can sometimes accidentally be overwritten when transferring outputs. Sometimes this is intentional but usually not (for example when running tests on old data and forgetting to change the output path). There should be checks in place to make not overwriting the default behavior.

Solution

The -n flag in gsutil cp prevents overwriting existing files. We can determine whether to overwrite existing files using an overwrite boolean input variable (with default setting of false). Check for consistency between the overwrite variable and whether file exists. Error out if overwrite is false but the file already exists, or if overwrite is true but a file does not already exist.

The -n option writes to stderr Skipping existing item... if file already exists, and we can use this output to do the above check. To avoid too much repetitive code, we can create a bash associative array with each file source mapped to each destination. Then have a loop iterate over the associative array running gsutil cp, redirect stderr to a variable, echo the variable, and do the above check.

Upstream effects

overwrite = false in input JSON to prevent accidental overwrite if previous analysis run's setting were changed to overwrite = true

Downstream effects

None.

sam-baird commented 2 weeks ago

loop iterate over the associative array running gsutil cp, redirect stderr to a variable, echo the variable, and do the above check.

This will probably be a little complicated because you probably need a different set of commands and checks depending on the overwrite variable. For example if overwrite is true we would want to run gsutil ls first to see if the file exists then error out if it does not because this variable was likely set in error.

sam-baird commented 1 week ago

Look at H5 repo as an example for looping over files and destinations