EBI-Metagenomics / genomes-catalogue-pipeline

MGnify genome analysis pipeline
Other
100 stars 21 forks source link

fix drep-subwf #6

Closed KateSakharova closed 3 years ago

KateSakharova commented 3 years ago

- Drep-subwf: I tried to decrease number of returning files and intermediate copies 1) drep step returns Cdb.csv and Mdb.csv files for split_drep and Sdb.csv to detect cluster reps in clusters 2) split_drep creates split_text file to create clusters ex:

many_genomes:1_1:CAJJTO01.fa,CAJKGB01.fa,CAJLGA01.fa
many_genomes:2_1:CAJKRE01.fa,CAJKXJ01.fa
one_genome:3_0:CAJKRY01.fa
one_genome:4_0:CAJKXZ01.fa

Also this script creates mash files in mash_folder (as classify_drep step did before) 3) classify_drep uses text file to create many_genomes and one_genome folders. It does nothing with mash files

- GUNC One-genome-subwf doesn't return all files with GUNC decisions (_complete.txt or _empty.txt). There is a step that generates two reports about complete genomes and genomes that didn't pass filtering.

- GTDB-Tk and rRNA input folder Drep step doesn't return folder with dereplicated genomes. It returns Sdb.csv with scores. Those scores will be used to identify cluster representative in each cluster. Added step to identify list of dreplicated genomes and genomes that passed GUNC filtering. Script takes _splittext file from drep-subwf and identifies many-genomes clusters and genomes. Then using Sdb.csv script detects the best cluster representative. Those chosen genomes come to list of drep-filtered genomes. Also script adds genomes from GUNC-report that passed filtering. Finally, this step creates folder with drep-filtered+GUNC-passed genomes. This folder goes as input to GTDB-Tk and rRNA detection.

Issues: 1) condition for GTDB-Tk seems not working. This step was fully commented. 2) drep-subwf doesn't work in main.cwl. All steps from drep-subwf were moved to main.cwl. 3) mgyg + drep-subwf still don't work together. Singularity container mount/build failed https://github.com/common-workflow-language/cwltool/pull/1387

singularity \
        --quiet \
        exec \
        --contain \
        --ipc \
        --pid \
        --home \
        /tmp/5b6632eee34f5fd7bafb4df0fff87c91/9feb/e97b/tmped6yf0er/tmp-out1ovad92y:/ZyhnaJ \
        --bind \
        /tmp/5b6632eee34f5fd7bafb4df0fff87c91/9feb/e97b/tmp2apzt19papp1cjnv:/tmp:rw \
        --bind \
        /tmp/d7e49f9a67cc50f0850ee65330c22031/6c4b/dcca/tmpu4f2nbr4/out/mgyg_genomes:/var/lib/cwl/stg26ff51f8-440d-4afe-b649-b38b754ace9f/mgyg_genomes:ro \
        --bind \
        /hps/nobackup/rdf/metagenomics/toil-jobstore/genomes-pipeline-test/marine-12/files/for-job/kind-CWLJob/instance-yhofsng8/file-ae38c694f6044584b802c6595ec8de87/MGYG000296542.fa:/var/lib/cwl/stg26ff51f8-440d-4afe-b649-b38b754ace9f/mgyg_genomes/MGYG000296542.fa:ro \
...... all genomes .......
        --bind \
        /hps/nobackup/rdf/metagenomics/toil-jobstore/genomes-pipeline-test/marine-12/files/for-job/kind-CWLJob/instance-yhofsng8/file-741f2433a0e94b7889d5ce1f7720011c/MGYG000296143.fa:/var/lib/cwl/stg26ff51f8-440d-4afe-b649-b38b754ace9f/mgyg_genomes/MGYG000296143.fa:ro \
        --pwd \
        /ZyhnaJ \
        /hps/nobackup/rdf/metagenomics/singularity_cache/microbiomeinformatics_genomes-pipeline.genome-catalog-update:v1.sif \
        generate_extra_weight_table.py \
        -o \
        extra_weight_table.txt \
        -d \
        /var/lib/cwl/stg26ff51f8-440d-4afe-b649-b38b754ace9f/mgyg_genomes
    WARNING: Overriding HOME environment variable with SINGULARITYENV_HOME is not permitted
    WARNING: skipping mount of /tmp/d7e49f9a67cc50f0850ee65330c22031/6c4b/dcca/tmpu4f2nbr4/out/mgyg_genomes: stat /tmp/d7e49f9a67cc50f0850ee65330c22031/6c4b/dcca/tmpu4f2nbr4/out/mgyg_genomes: no such file or directory
    FATAL:   container creation failed: mount /tmp/d7e49f9a67cc50f0850ee65330c22031/6c4b/dcca/tmpu4f2nbr4/out/mgyg_genomes->/var/lib/cwl/stg26ff51f8-440d-4afe-b649-b38b754ace9f/mgyg_genomes error: while mounting /tmp/d7e49f9a67cc50f0850ee65330c22031/6c4b/dcca/tmpu4f2nbr4/out/mgyg_genomes: mount source /tmp/d7e49f9a67cc50f0850ee65330c22031/6c4b/dcca/tmpu4f2nbr4/out/mgyg_genomes doesn't exist

Solution: generate_extra_weight_table step is running without container

4) Too long input for container - too many input files to mount. Decreasing number of files - solve problem

KateSakharova commented 3 years ago

This subwf was also updated in further branch prepare-output, but subwf still doesn't work as part of pipeline (separate steps work)