PacificBiosciences / pb-metagenomics-tools

Tools and pipelines tailored to using PacBio HiFi Reads for metagenomics
BSD 3-Clause Clear License
170 stars 34 forks source link

Error in localrule MAGContigNames #86

Closed pailloufat-stack closed 3 weeks ago

pailloufat-stack commented 3 weeks ago

Hi @dportik ,

I have an issue concerning the MAGContigNames rule. It is the same one that the issue #84 , but the author solved it without any explanation.. And so, the 8-summary is empty. I have :

grep -h '>' /home/vipailler/Scripts/Hifi-MAG/pb-metagenomics-tools/HiFi-MAG-Pipeline/8-summary/asm/MAGs*.fa | cut -d'>' -f2 1> /home/vipailler/Scripts/Hifi-MAG/pb-metagenomics-tools/HiFi-MAG-Pipeline/2-bam/asm.MAG_contigs.txt 2> /home/vipailler/Scripts/Hifi-MAG/pb-metagenomics-tools/HiFi-MAG-Pipeline/logs/asm.MAGContigNames.log
[Thu Sep 26 10:03:15 2024]
Error in rule MAGContigNames:
    jobid: 29
    input: /home/vipailler/Scripts/Hifi-MAG/pb-metagenomics-tools/HiFi-MAG-Pipeline/8-summary/asm/asm.HiFi_MAG.summary.txt, /home/vipailler/Scripts/Hifi-MAG/pb-metagenomics-tools/HiFi-MAG-Pipeline/8-summary/asm/MAGs
    output: /home/vipailler/Scripts/Hifi-MAG/pb-metagenomics-tools/HiFi-MAG-Pipeline/2-bam/asm.MAG_contigs.txt
    log: /home/vipailler/Scripts/Hifi-MAG/pb-metagenomics-tools/HiFi-MAG-Pipeline/logs/asm.MAGContigNames.log (check log file(s) for error details)
    shell:
        grep -h '>' /home/vipailler/Scripts/Hifi-MAG/pb-metagenomics-tools/HiFi-MAG-Pipeline/8-summary/asm/MAGs*.fa | cut -d'>' -f2 1> /home/vipailler/Scripts/Hifi-MAG/pb-metagenomics-tools/HiFi-MAG-Pipeline/2-bam/asm.MAG_contigs.txt 2> /home/vipailler/Scripts/Hifi-MAG/pb-metagenomics-tools/HiFi-MAG-Pipeline/logs/asm.MAGContigNames.log
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

The /home/vipailler/Scripts/Hifi-MAG/pb-metagenomics-tools/HiFi-MAG-Pipeline/8-summary/asm/MAGs*.fa does not exist, which explains why the grep command failed.

I suspect something wrong with the previous GTDBTk rules, but the logs/asm.GTDBTkAnalysis.log file is without error :

[2024-09-26 09:39:56] INFO: GTDB-Tk v2.1.1
[2024-09-26 09:39:56] INFO: gtdbtk classify_wf --batchfile /home/vipailler/Scripts/Hifi-MAG/pb-metagenomics-tools/HiFi-MAG-Pipeline/6-checkm2/asm/asm.GTDBTk_batch_file.txt --out_dir /home/vipailler/Scripts/Hifi-MAG/pb-metagenomics-tools/HiFi-MAG-Pipeline/7-gtdbtk/asm/ -x fa --prefix asm --cpus 8
[2024-09-26 09:39:56] INFO: Using GTDB-Tk reference data version r207: /home/vipailler/Scripts/Hifi-MAG/pb-metagenomics-tools/HiFi-MAG-Pipeline/release207_v2
[2024-09-26 09:39:56] INFO: Identifying markers in 7 genomes with 8 threads.
[2024-09-26 09:39:56] TASK: Running Prodigal V2.6.3 to identify genes.
                                                                             [2024-09-26 09:40:10] INFO: Completed 7 genomes in 13.95 seconds (1.99 seconds/genome).
[2024-09-26 09:40:10] TASK: Identifying TIGRFAM protein families.
                                                                             [2024-09-26 09:40:14] INFO: Completed 7 genomes in 3.88 seconds (1.80 genomes/second).
[2024-09-26 09:40:14] TASK: Identifying Pfam protein families.
                                                                             [2024-09-26 09:40:15] INFO: Completed 7 genomes in 0.24 seconds (29.07 genomes/second).
[2024-09-26 09:40:15] INFO: Annotations done using HMMER 3.1b2 (February 2015).
[2024-09-26 09:40:15] TASK: Summarising identified marker genes.
                                                                            [2024-09-26 09:40:15] INFO: Completed 7 genomes in 0.17 seconds (41.21 genomes/second).
[2024-09-26 09:40:15] INFO: Done.
[2024-09-26 09:40:15] INFO: Aligning markers in 7 genomes with 8 CPUs.
[2024-09-26 09:40:15] INFO: Processing 7 genomes identified as bacterial.
[2024-09-26 09:40:20] INFO: Read concatenated alignment for 62,291 GTDB genomes.
[2024-09-26 09:40:20] TASK: Generating concatenated alignment for each marker.
                                                                   [2024-09-26 09:40:20] INFO: Completed 7 genomes in 0.03 seconds (262.62 genomes/second).
[2024-09-26 09:40:21] TASK: Aligning 120 identified markers using hmmalign 3.1b2 (February 2015).
[2024-09-26 09:40:24] INFO: Completed 120 markers in 3.17 seconds (37.80 markers/second).
[2024-09-26 09:40:24] TASK: Masking columns of bacterial multiple sequence alignment using canonical mask.
                                                                                          [2024-09-26 09:42:10] INFO: Completed 62,298 sequences in 1.75 minutes (35,515.07 sequences/minute).
[2024-09-26 09:42:10] INFO: Masked bacterial alignment from 41,084 to 5,036 AAs.
[2024-09-26 09:42:10] INFO: 0 bacterial user genomes have amino acids in <10.0% of columns in filtered MSA.
[2024-09-26 09:42:10] INFO: Creating concatenated alignment for 62,298 bacterial GTDB and user genomes.
[2024-09-26 09:42:28] INFO: Creating concatenated alignment for 7 bacterial user genomes.
[2024-09-26 09:42:29] INFO: Done.
[2024-09-26 09:42:29] TASK: Placing 7 bacterial genomes into backbone reference tree with pplacer using 8 CPUs (be patient).
[2024-09-26 09:42:29] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
==> Running pplacer v1.1.alpha19-0-g807f6f3 analysis on /home/vipailler/Scripts/Hifi-MAG/pb-metagenomics-tools/HiFi-MAG-Pipeline/7-gtdbtk/asm/align/asm.bac120==> Step 2 of 9: Pre-masking sequences.                                                                                                                       [2024-09-26 09:44:32] INFO: Calculating RED values based on reference tree.
[2024-09-26 09:44:33] INFO: 7 out of 7 have an class assignments. Those genomes will be reclassified.
[2024-09-26 09:44:33] TASK: Placing 2 bacterial genomes into class-level reference tree 3 (1/4) with pplacer using 8 CPUs (be patient).
==> Running pplacer v1.1.alpha19-0-g807f6f3 analysis on /home/vipailler/Scripts/Hifi-MAG/pb-metagenomics-tools/HiFi-MAG-Pipeline/7-gtdbtk/asm/classify/interme==> Step 2 of 9: Pre-masking sequences.                                                                                                                       [2024-09-26 09:50:07] INFO: Calculating RED values based on reference tree.
[2024-09-26 09:50:09] TASK: Traversing tree to determine classification method.
                                                                   [2024-09-26 09:50:09] INFO: Completed 2 genomes in 0.00 seconds (7,423.55 genomes/second).
[2024-09-26 09:50:09] TASK: Calculating average nucleotide identity using FastANI (v1.3).
                                                                                       [2024-09-26 09:50:11] INFO: Completed 26 comparisons in 1.75 seconds (14.82 comparisons/second).
[2024-09-26 09:50:11] INFO: 0 genome(s) have been classified using FastANI and pplacer.
[2024-09-26 09:50:11] TASK: Placing 2 bacterial genomes into class-level reference tree 7 (2/4) with pplacer using 8 CPUs (be patient).
==> Running pplacer v1.1.alpha19-0-g807f6f3 analysis on /home/vipailler/Scripts/Hifi-MAG/pb-metagenomics-tools/HiFi-MAG-Pipeline/7-gtdbtk/asm/classify/interme==> Step 2 of 9: Pre-masking sequences.                                                                                                                       [2024-09-26 09:53:14] INFO: Calculating RED values based on reference tree.
[2024-09-26 09:53:15] TASK: Traversing tree to determine classification method.
                                                                   [2024-09-26 09:53:15] INFO: Completed 2 genomes in 0.00 seconds (4,978.40 genomes/second).
[2024-09-26 09:53:15] TASK: Calculating average nucleotide identity using FastANI (v1.3).
                                                                                       [2024-09-26 09:53:22] INFO: Completed 58 comparisons in 6.80 seconds (8.53 comparisons/second).
[2024-09-26 09:53:22] INFO: 0 genome(s) have been classified using FastANI and pplacer.
[2024-09-26 09:53:22] TASK: Placing 2 bacterial genomes into class-level reference tree 4 (3/4) with pplacer using 8 CPUs (be patient).
==> Running pplacer v1.1.alpha19-0-g807f6f3 analysis on /home/vipailler/Scripts/Hifi-MAG/pb-metagenomics-tools/HiFi-MAG-Pipeline/7-gtdbtk/asm/classify/interme==> Step 2 of 9: Pre-masking sequences.                                                                                                                       [2024-09-26 09:59:04] INFO: Calculating RED values based on reference tree.
[2024-09-26 09:59:06] TASK: Traversing tree to determine classification method.
                                                                   [2024-09-26 09:59:06] INFO: Completed 2 genomes in 0.00 seconds (2,211.60 genomes/second).
[2024-09-26 09:59:07] TASK: Calculating average nucleotide identity using FastANI (v1.3).
                                                                                         [2024-09-26 09:59:18] INFO: Completed 124 comparisons in 11.10 seconds (11.17 comparisons/second).
[2024-09-26 09:59:18] INFO: 0 genome(s) have been classified using FastANI and pplacer.
[2024-09-26 09:59:18] TASK: Placing 1 bacterial genomes into class-level reference tree 1 (4/4) with pplacer using 8 CPUs (be patient).
==> Running pplacer v1.1.alpha19-0-g807f6f3 analysis on /home/vipailler/Scripts/Hifi-MAG/pb-metagenomics-tools/HiFi-MAG-Pipeline/7-gtdbtk/asm/classify/interme==> Step 2 of 9: Pre-masking sequences.                                                                                                                       [2024-09-26 10:03:07] INFO: Calculating RED values based on reference tree.
[2024-09-26 10:03:09] TASK: Traversing tree to determine classification method.
                                                                   [2024-09-26 10:03:09] INFO: Completed 1 genome in 0.00 seconds (5,817.34 genomes/second).
[2024-09-26 10:03:10] TASK: Calculating average nucleotide identity using FastANI (v1.3).
                                                                                       [2024-09-26 10:03:12] INFO: Completed 14 comparisons in 1.84 seconds (7.62 comparisons/second).
[2024-09-26 10:03:12] INFO: 0 genome(s) have been classified using FastANI and pplacer.
[2024-09-26 10:03:12] INFO: Note that Tk classification mode is insufficient for publication of new taxonomic designations. New designations should be based on one or more de novo trees, an example of which can be produced by Tk in de novo mode.
[2024-09-26 10:03:12] INFO: Done.
[2024-09-26 10:03:12] INFO: Removing intermediate files.
[2024-09-26 10:03:12] INFO: Intermediate files removed.
[2024-09-26 10:03:12] INFO: Done.

I have 218 bins after the dereplication step with ll 5-dereplicated-bins/asm | wc -l . The next filtering checkm2 step worked well , but at the end of the job, I only got 7 bins, which is quite low right ?

cat /home/vipailler/Scripts/Hifi-MAG/pb-metagenomics-tools/HiFi-MAG-Pipeline/logs/asm.AssessCheckm2Bins.log
make_checkm_df: Making checkm2 dataframe.
make_depth_dict: Making depth dictionary.
add_contig_numbers_and_status: Adding contig numbers and assessing Pass/Fail filtering status.
add_contig_numbers_and_status: Done.
get_passing_bins: Identifying bins passing filters.
write_gtdb_batch_file: Writing GTDB batch file.
write_gtdb_batch_file: 7 bins passed filtering.
write_fork_target_file: Writing fork target file.
write_updated_tsv_file: Writing updated tsv file.

Is that low nomber of bins that leads to an uncompleted taxonomic assignment by GTDB-Tk ? Would you have an explanation ? Best

bak1121 commented 3 weeks ago

Hi @pailloufat-stack , I solved this issue by adding the backslash at line 558 in Snakefile-hifimags.smk. "grep -h '>' {input.mag_dir}.fa | cut -d'>' -f2 1> {output} 2> {log}"
"grep -h '>' {input.mag_dir}/
.fa | cut -d'>' -f2 1> {output} 2> {log}"

pailloufat-stack commented 3 weeks ago

Hi @bak1121 , Indeed, that solved the problem. Thanks