bxlab / metaWRAP

MetaWRAP - a flexible pipeline for genome-resolved metagenomic data analysis
MIT License
396 stars 190 forks source link

assembly sop #6

Open karoraw1 opened 6 years ago

karoraw1 commented 6 years ago

My current understanding of how the metaWRAP pipeline should work involves doing QC on the demultiplexed libraries (I receive the data demultiplexed from the sequencing people), concatenating the outputs, and assembling.

I concatenate by repeatedly running this command on the output of the QC module:

cat $Trim_/Site-D1-DNA/final_pure_reads_1.fastq >> $Trim_/thrashLibs_1.fastq & \
cat $Trim_/Site-D1-DNA/final_pure_reads_2.fastq >> $Trim_/thrashLibs_2.fastq;

cat $Trim_/Site-D2-DNA/final_pure_reads_1.fastq >> $Trim_/thrashLibs_1.fastq & \
cat $Trim_/Site-D2-DNA/final_pure_reads_2.fastq >> $Trim_/thrashLibs_2.fastq;

However, when I assemble the huge pair of files produced by the above, I inconsistently get this error, depending on the input files:

Verification of expression 'irsl.eof() && irsr.eof()' failed in function 'void hammer::CorrectPairedReadFiles(const KMerData&, size_t&, size_t&, size_t&, size_t&, const string&, const string&, std::ofstream*, std::ofstream*, std::ofstream*, std::ofstream*, std::ofstream*)'. In file '/spades/src/projects/hammer/hammer_tools.cpp' on line 188. Message 'Pair of read files /home-3/karoraw1@jhu.edu/scratch/THRASH_LIBS/TRIM/thrashLibs_1.fastq and /home-3/karoraw1@jhu.edu/scratch/THRASH_LIBS/TRIM/thrashLibs_2.fastq contain unequal amount of reads'.
Verification of expression 'irsl.eof() && irsr.eof()' failed in function 'void hammer::CorrectPairedReadFiles(const KMerData&, size_t&, size_t&, size_t&, size_t&, const string&, const string&, std::ofstream*, std::ofstream*, std::ofstream*, std::ofstream*, std::ofstream*)'. In file '/spades/src/projects/hammer/hammer_tools.cpp' on line 188. Message 'Pair of read files /home-3/karoraw1@jhu.edu/scratch/THRASH_LIBS/TRIM/thrashLibs_1.fastq and /home-3/karoraw1@jhu.edu/scratch/THRASH_LIBS/TRIM/thrashLibs_2.fastq contain unequal amount of reads'.
hammer: /spades/src/projects/hammer/hammer_tools.cpp:188: void hammer::CorrectPairedReadFiles(const KMerData&, size_t&, size_t&, size_t&, size_t&, const string&, const string&, std::ofstream*, std::ofstream*, std::ofstream*, std::ofstream*, std::ofstream*): Assertion `irsl.eof() && irsr.eof()' failed.

== Error ==  system call for: "['/home-3/karoraw1@jhu.edu/SPAdes-3.9.0-Linux/bin/hammer', '/home-3/karoraw1@jhu.edu/scratch/THRASH_LIBS/DNA_ASSEMBLY/metaspades/corrected/configs/config.info']" finished abnormally, err code: -6

This is the assembly job script:

thrash_dna_assemble.sh.txt

I assume this result is produced because of orphaned read pairs and am in the process of trying to resync the pairs using BBMap's repair.sh, but I'm not sure this is the right approach.

My questions are as follows:

Thanks! Keith

ursky commented 6 years ago

Hello Keith!

You are correct in you approach to the pipeline. Ideally, the reads from your samples should be QCed separately, and then concatenated into two giant forward and reverse read files for reassembly.

The errors you report absolutely stem from either orphaned reads, or the order of the reads being different between the forward and reverse reads being different. Unfortunately without actually seeing the files I cannot troubleshoot this for you.

It looks like you're doing everything correctly, so I am very curious what is going on. If I remember the algorithm correctly, the forward and reverse reads produced from metaWRAP's Read_QC module are necessarily correctly synced. You can easily test this by trying to assemble the samples individually and seeing if the error is thrown.

The more likely explanation is that something went wrong when you concatenated the files. I would be careful how you use & - it can sometimes produce unexpected results. You might have messed up the order of the files, or started to append the next file before the previous one finished. Its difficult to say without actually seeing the files. Start by checking the number of forward and reverse reads, and then maybe manually inspect the start and end of the files to see if they make sense. I emphasize if you did everything correctly then they should be correctly paired and you should not have to run repair on them.

Of course if you did everything correctly and it still doesn't work, then I have some debugging to do on my end...

karoraw1 commented 6 years ago

I deleted and re-concatenated the libraries one file at a time, and assembly seems now to be working fine. I realized I should have been using double ampersands, if at all.

I will close this non-issue, but I was wondering your opinion on the various possible combinations of flags to provide to the assembly module.

If time and resources are available, is it best to go with both --use-megahit and --use-metaspades?

ursky commented 6 years ago

The module can only assembly with metaSPAdes OR MegaHit. If resources permit, metaSPAdes usually gives a superior assembly. The main downside is that the memory and time scaling is pretty bad with metaSPAdes compared to MegaHit.