bonsai-team / matam

Mapping-Assisted Targeted-Assembly for Metagenomics
GNU Affero General Public License v3.0
19 stars 9 forks source link

matam_assembly.py crashes at ovgraphbuild stage 'non-zero return code: 136' #98

Closed tpriest0 closed 4 years ago

tpriest0 commented 4 years ago

In order to try and obtain long 16S rRNA reads that can be associated with my metagenomic bins, I decided to use bbmap and matam. I used bbmap to map all raw reads against a metagenomic bin (idfilter=100). I then used the mapped reads as an input for matam, to try and assemble a 16S rRNA. The code I used to run matam was:

matam_assembly.py -d ~/databases/silva/matam/SILVA_128_SSURef_NR95 -i bin_mapped_reads.fq --cpu 8 --max_memory 10000 -v

The initial steps seem successful (e.g. reads compared to database and an output containing 16S rRNA reads is produced) but then it crashes at the 'ovgraphbuild' stage. This is the log file contents:

INFO - === MATAM assembly === INFO - CMD: /bioinf/home/tpriest/software/miniconda/envs/matam/opt/matam-1.6.0/scripts/matam_assembly.py --verbose --cpu 8 --max_memory 10000 --best 10 --evalue 1.00e-05 --score_threshold 0.90 --coverage_threshold 0 --min_identity 1.00 --min_overlap_length 50 --min_read_node 1 --min_overlap_edge 1 --quorum 0.51 --read_correction auto --contig_coverage_threshold 20 --min_scaffold_length 500 --out_dir /scratch/tpriest/fram/flavo_bins/mapped/matam_assembly --ref_db /bioinf/home/tpriest/databases/silva/matam/SILVA_128_SSURef_NR95 --input_fastq /scratch/tpriest/fram/flavo_bins/mapped/fram_B_bin_69_1_mapped_reads.fq INFO - === Input === INFO - Input file: /scratch/tpriest/fram/flavo_bins/mapped/fram_B_bin_69_1_mapped_reads.fq INFO - Input file reads nb: 559254 reads INFO - === Reads mapping against ref db ===

Program: SortMeRNA version 2.1b, 03/03/2016 Copyright: 2012-16 Bonsai Bioinformatics Research Group: LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe 2014-16 Knight Lab: Department of Pediatrics, UCSD, La Jolla, Disclaimer: SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. Contact: Evguenia Kopylova, jenya.kopylov@gmail.com Laurent Noé, laurent.noe@lifl.fr Hélène Touzet, helene.touzet@lifl.fr

Computing read file statistics ... done [2.09 sec] size of reads file: 201370704 bytes partial section(s) to be executed: 1 of size 201370704 bytes Parameters summary: Number of seeds = 2 Edges = 4 (as integer) SW match = 2 SW mismatch = -3 SW gap open penalty = 5 SW gap extend penalty = 2 SW ambiguous nucleotide = -3 SQ tags are not output Number of threads = 8

Begin mmap reads section # 1: Time to mmap reads and set up pointers [0.41 sec]

Begin analysis of: /bioinf/home/tpriest/databases/silva/matam/SILVA_128_SSURef_NR95.clustered.fasta Seed length = 18 Pass 1 = 18, Pass 2 = 9, Pass 3 = 3 Gumbel lambda = 0.605719 Gumbel K = 0.333508 Minimal SW score based on E-value = 77 Loading index part 1/1 ... done [20.25 sec] Begin index search ... done [25.66 sec] Freeing index ... done [1.50 sec] Total number of reads mapped (incl. all reads file sections searched): 29 Writing alignments ... done [3.75 sec] Writing aligned FASTA/FASTQ ... done [0.00 sec] INFO - Reads mapping completed in 55.6605 seconds wall time INFO - Identified as marker: 29 / 559254 reads (0.01%) INFO - === Alignment filtering === INFO - Good alignments filtering completed in 0.1081 seconds wall time INFO - === Overlap-graph building ===

PARAM: References: /bioinf/home/tpriest/databases/silva/matam/SILVA_128_SSURef_NR95.clustered.fasta PARAM: Sam file: /scratch/tpriest/fram/flavo_bins/mapped/matam_assembly/workdir/fram_B_bin_69_1_mapped_reads.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.scr_filt_geo_90pct.sam PARAM: Output basename: /scratch/tpriest/fram/flavo_bins/mapped/matam_assembly/workdir/fram_B_bin_69_1_mapped_reads.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.scr_filt_geo_90pct.ovgb_i100_o50 PARAM: ASQG output: 0 PARAM: CSV output: 1 PARAM: Min Overlap: 50 PARAM: Id Threshold: 1 PARAM: NoIndel: 0 PARAM: Debug: 0 PARAM: Verbose: 1 PARAM: Test: 0

TIME: Reference fasta file read in 2.2197 seconds. INFO: 76956 reference sequences were loaded

TIME: References names loaded from the SAM file in 0.000211 seconds. INFO: 12 references are present in the SAM file

TIME: SAM file reading finished in 0.000504 seconds. INFO: 38 bam records were read, representing 29 reads INFO: 38 bam record were mapped on a reference, representing 29 mapped reads

Floating point exception INFO - CMD: /bioinf/home/tpriest/software/miniconda/envs/matam/opt/matam-1.6.0/ovgraphbuild/bin/ovgraphbuild -i 1.0 -m 50 --csv --output_basename /scratch/tpriest/fram/flavo_bins/mapped/matam_assembly/workdir/fram_B_bin_69_1_mapped_reads.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.scr_filt_geo_90pct.ovgb_i100_o50 -r /bioinf/home/tpriest/databases/silva/matam/SILVA_128_SSURef_NR95.clustered.fasta -s /scratch/tpriest/fram/flavo_bins/mapped/matam_assembly/workdir/fram_B_bin_69_1_mapped_reads.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.scr_filt_geo_90pct.sam -v CRITICAL - The last command returns a non-zero return code: 136 Non-zero return code

loic-couderc commented 4 years ago

Hi @tpriest0,

The Overlap-graph building step failed because there is not enough reads that pass the first step of MATAM (i.e there is only 29 reads in your fastq files coming from 16S rRNA).

INFO - Identified as marker: 29 / 559254 reads (0.01%)

With metagenomics data, we expect about 1% of all reads to be 16S rRNA.

tpriest0 commented 4 years ago

Hi @loic-couderc

Thank you for your quick reply. I expected this was the reason, due to me trying to build 16S rRNA genes for single bins and not whole metagenomes. Is there a minimum number of sequences that is required for graph building in the matam pipeline?

ppericard commented 4 years ago

Hi @tpriest0,

Could you explain a bit more what you're trying to do? What type of data are you working with? And what are you trying to do with bbmap?

The way MATAM works is that it will identify and align all rRNA reads from the input dataset using SortMeRNA (this is very sensitive so we don't expect to miss a lot of rRNA reads), and then the downstream steps will try to assemble those rRNA reads.

Usually in whole metagenomic sequencing, we expect less than 1% of all reads to be rRNA, but we had some good results with MATAM even with just a few thousand rRNA reads. In your case, 29 reads are probably too few. The only case it might work is if those reads would come from the exact same rRNA sequence and covering it from 5' to 3'.

tpriest0 commented 4 years ago

Hi @ppericard

My original plan was to try and assemble longer 16S rRNA genes that are associated with specific metagenomic bins that I have, in order to place them in a phylogenetic tree. I mapped all raw reads against a bin and then use the mapped reads as an input for MATAM. I wasn't expecting this to work in every case as a large number of bins lack any rRNA sequences but for the few that do have some I was hoping to reconstruct longer genes. As a result of this, the number of rRNA reads for each input would likely be quite low.

I am now thinking that perhaps I need to approach this from a different angle. Using matam to obtain longer rRNA reads from whole metagenomes works well in my experience, but then how to associate these with specific bins? I think this is an open question in metagenomics that as yet, does not have a clear answer.

ppericard commented 4 years ago

Maybe you could try and do the reverse. Try extracting all the rRNA reads from your complete metagenomic dataset, using SortMeRNA, or directly using MATAM (the first step is a filtering/alignment with SortMeRNA). Then assemble those reads using MATAM, and finally, you could map the assembled contigs to your metagenomic bins. The problem with mapping first against metagenomics bins is that you're using a mapper that is not specifically designed for rRNA reads which means you might not get all the rRNA reads that are in your original dataset. Using SortMeRNA first guarantees that you will get almost every rRNA reads from the complete dataset.

tpriest0 commented 4 years ago

Hi @ppericard

That's not a bad idea. Thank you for the suggestion, I will give that a go.