- Drep-subwf:
I tried to decrease number of returning files and intermediate copies
1) drep step returns Cdb.csv and Mdb.csv files for split_drep and Sdb.csv to detect cluster reps in clusters
2) split_drep creates split_text file to create clusters
ex:
Also this script creates mash files in mash_folder (as classify_drep step did before)
3) classify_drep uses text file to create many_genomes and one_genome folders. It does nothing with mash files
- GUNC
One-genome-subwf doesn't return all files with GUNC decisions (_complete.txt or _empty.txt). There is a step that generates two reports about complete genomes and genomes that didn't pass filtering.
- GTDB-Tk and rRNA input folder
Drep step doesn't return folder with dereplicated genomes. It returns Sdb.csv with scores. Those scores will be used to identify cluster representative in each cluster.
Added step to identify list of dreplicated genomes and genomes that passed GUNC filtering. Script takes _splittext file from drep-subwf and identifies many-genomes clusters and genomes. Then using Sdb.csv script detects the best cluster representative. Those chosen genomes come to list of drep-filtered genomes.
Also script adds genomes from GUNC-report that passed filtering. Finally, this step creates folder with drep-filtered+GUNC-passed genomes. This folder goes as input to GTDB-Tk and rRNA detection.
Issues:
1) condition for GTDB-Tk seems not working. This step was fully commented.
2) drep-subwf doesn't work in main.cwl. All steps from drep-subwf were moved to main.cwl.
3) mgyg + drep-subwf still don't work together. Singularity container mount/build failed
https://github.com/common-workflow-language/cwltool/pull/1387
singularity \
--quiet \
exec \
--contain \
--ipc \
--pid \
--home \
/tmp/5b6632eee34f5fd7bafb4df0fff87c91/9feb/e97b/tmped6yf0er/tmp-out1ovad92y:/ZyhnaJ \
--bind \
/tmp/5b6632eee34f5fd7bafb4df0fff87c91/9feb/e97b/tmp2apzt19papp1cjnv:/tmp:rw \
--bind \
/tmp/d7e49f9a67cc50f0850ee65330c22031/6c4b/dcca/tmpu4f2nbr4/out/mgyg_genomes:/var/lib/cwl/stg26ff51f8-440d-4afe-b649-b38b754ace9f/mgyg_genomes:ro \
--bind \
/hps/nobackup/rdf/metagenomics/toil-jobstore/genomes-pipeline-test/marine-12/files/for-job/kind-CWLJob/instance-yhofsng8/file-ae38c694f6044584b802c6595ec8de87/MGYG000296542.fa:/var/lib/cwl/stg26ff51f8-440d-4afe-b649-b38b754ace9f/mgyg_genomes/MGYG000296542.fa:ro \
...... all genomes .......
--bind \
/hps/nobackup/rdf/metagenomics/toil-jobstore/genomes-pipeline-test/marine-12/files/for-job/kind-CWLJob/instance-yhofsng8/file-741f2433a0e94b7889d5ce1f7720011c/MGYG000296143.fa:/var/lib/cwl/stg26ff51f8-440d-4afe-b649-b38b754ace9f/mgyg_genomes/MGYG000296143.fa:ro \
--pwd \
/ZyhnaJ \
/hps/nobackup/rdf/metagenomics/singularity_cache/microbiomeinformatics_genomes-pipeline.genome-catalog-update:v1.sif \
generate_extra_weight_table.py \
-o \
extra_weight_table.txt \
-d \
/var/lib/cwl/stg26ff51f8-440d-4afe-b649-b38b754ace9f/mgyg_genomes
WARNING: Overriding HOME environment variable with SINGULARITYENV_HOME is not permitted
WARNING: skipping mount of /tmp/d7e49f9a67cc50f0850ee65330c22031/6c4b/dcca/tmpu4f2nbr4/out/mgyg_genomes: stat /tmp/d7e49f9a67cc50f0850ee65330c22031/6c4b/dcca/tmpu4f2nbr4/out/mgyg_genomes: no such file or directory
FATAL: container creation failed: mount /tmp/d7e49f9a67cc50f0850ee65330c22031/6c4b/dcca/tmpu4f2nbr4/out/mgyg_genomes->/var/lib/cwl/stg26ff51f8-440d-4afe-b649-b38b754ace9f/mgyg_genomes error: while mounting /tmp/d7e49f9a67cc50f0850ee65330c22031/6c4b/dcca/tmpu4f2nbr4/out/mgyg_genomes: mount source /tmp/d7e49f9a67cc50f0850ee65330c22031/6c4b/dcca/tmpu4f2nbr4/out/mgyg_genomes doesn't exist
Solution:
generate_extra_weight_table step is running without container
4) Too long input for container - too many input files to mount. Decreasing number of files - solve problem
cat step (container was removed)
generate_extra_weight_table (container was removed)
generate_gunc_report (one intermediate step was added)
- Drep-subwf: I tried to decrease number of returning files and intermediate copies 1) drep step returns Cdb.csv and Mdb.csv files for split_drep and Sdb.csv to detect cluster reps in clusters 2) split_drep creates split_text file to create clusters ex:
Also this script creates mash files in mash_folder (as classify_drep step did before) 3) classify_drep uses text file to create many_genomes and one_genome folders. It does nothing with mash files
- GUNC One-genome-subwf doesn't return all files with GUNC decisions (_complete.txt or _empty.txt). There is a step that generates two reports about complete genomes and genomes that didn't pass filtering.
- GTDB-Tk and rRNA input folder Drep step doesn't return folder with dereplicated genomes. It returns Sdb.csv with scores. Those scores will be used to identify cluster representative in each cluster. Added step to identify list of dreplicated genomes and genomes that passed GUNC filtering. Script takes _splittext file from drep-subwf and identifies many-genomes clusters and genomes. Then using Sdb.csv script detects the best cluster representative. Those chosen genomes come to list of drep-filtered genomes. Also script adds genomes from GUNC-report that passed filtering. Finally, this step creates folder with drep-filtered+GUNC-passed genomes. This folder goes as input to GTDB-Tk and rRNA detection.
Issues: 1) condition for GTDB-Tk seems not working. This step was fully commented. 2) drep-subwf doesn't work in main.cwl. All steps from drep-subwf were moved to main.cwl. 3) mgyg + drep-subwf still don't work together. Singularity container mount/build failed https://github.com/common-workflow-language/cwltool/pull/1387
Solution: generate_extra_weight_table step is running without container
4) Too long input for container - too many input files to mount. Decreasing number of files - solve problem