I have ran through the different steps of the pipelines for the 3 datasets, but only on a subset of 5 samples.
Here is how long the steps took on our training instance:
S_aureus
bacqc - 6 min
assembleBAC - 26 min
panaroo - 5 min
iqtree - <1 min
M_tuberculosis
bacqc - 6 min
bactmap - 10 min
pseudogenome -
mask_pseudogenome -
iqtree -
S_pneumo
bacqc - 6 min
assembleBAC - 23 min
bactmap - 6 min
pseudogenome - <1 min
gubbins - <1min
iqtree - <1min
funcscan - 3 min
From these timings, it's clear that running the larger workflows on the full dataset is not doable for a workshop setting, as it would take too long.
My proposal is that they run the workflows on a subset of 5 samples to see how it looks like, but then they can analyse the outputs from the preprocessed directory.
For the dowstream steps like phylogeny, it's probably fine to run it on the full datasets, using the preprocessed data as input (might need to tweak the shell scripts in that case, to use the preprocessed directory as input).
I have ran through the different steps of the pipelines for the 3 datasets, but only on a subset of 5 samples. Here is how long the steps took on our training instance:
bacqc
- 6 minassembleBAC
- 26 minpanaroo
- 5 miniqtree
- <1 minbacqc
- 6 minbactmap
- 10 minpseudogenome
-mask_pseudogenome
-iqtree
-bacqc
- 6 minassembleBAC
- 23 minbactmap
- 6 minpseudogenome
- <1 mingubbins
- <1miniqtree
- <1minfuncscan
- 3 minFrom these timings, it's clear that running the larger workflows on the full dataset is not doable for a workshop setting, as it would take too long. My proposal is that they run the workflows on a subset of 5 samples to see how it looks like, but then they can analyse the outputs from the preprocessed directory.
For the dowstream steps like phylogeny, it's probably fine to run it on the full datasets, using the preprocessed data as input (might need to tweak the shell scripts in that case, to use the preprocessed directory as input).