NBISweden / aMeta

Ancient microbiome snakemake workflow
MIT License
19 stars 15 forks source link

Make authentication bash script into snakemake workflow #57

Closed clami66 closed 2 years ago

clami66 commented 2 years ago

Since authentic.sh is quite complex and takes lots of inputs/generates lots of outputs, I have made it into another workflow file instead. This forced me to look into the authentication code and see a few things that could be improved (e.g. issue #53 , #56)

I'm sure lots of this can be improved as I'm not exactly fluent in snakemake, but I hope it can be useful

clami66 commented 2 years ago

@percyfal please also note that this PR changes the way some files are named by avoiding using the $REF_ID variable which makes (I think) impossible to write snakemake rules: REF_ID is extracted at runtime and used in the name of some output files and I did not find a way to make this work, even with checkpointing.

On the other hand, I don't know that it is necessary to use $REF_ID since it is unique to each sample and taxid. I think it would be good to find consensus about this in #56 before deciding if we should merge this.

BTW, this PR fixes #56

percyfal commented 2 years ago

I'm not sure I understand where $REF_ID is used? Do you mean where you use the suffix _done? Otherwise, what you say makes sense. I'll trigger the tests and wait for the consensus on #56 before merging.

clami66 commented 2 years ago

I'm not sure I understand where $REF_ID is used? Do you mean where you use the suffix _done? Otherwise, what you say makes sense. I'll trigger the tests and wait for the consensus on #56 before merging.

The output files in AUTHENTICATION usually look like this:

b-an01 [/proj/nobackup/metagenomics/ancient-microbiome-smk/.test]$ ls -ltr results/AUTHENTICATION/bar/632/
total 316

-rw-rw----+ 1 pochonz ps30331     96 feb 17 14:20 ref_64.sorted.bam.bai
-rw-rw----+ 1 pochonz ps30331   1335 feb 17 14:20 ref_64.sorted.bam
...
-rw-rw----+ 1 pochonz ps30331 218894 feb 17 14:20 ref_64.breadth_of_coverage
-rw-rw----+ 1 pochonz ps30331   1292 feb 17 14:20 ref_64.bam

Where ref_64 is the name of the first reference sequence output by MaltExtract (which apparently doesn't always work out as seen in #56 ). This PR does without the reference name and replaces ref_64 with the taxid again. In theory we could just rename the files so that no ID is used since these are univocally identified by the folder structure /bar/632