alekseyzimin / masurca

GNU General Public License v3.0
245 stars 35 forks source link

No pre-processing of Mate-pair/jump libraries? #11

Closed EarlyEvol closed 6 years ago

EarlyEvol commented 6 years ago

Does Masurca have a module to detect the biotin stuffer sequence in Nextera mate-pair libraries and split the reads? Does, "IMPORTANT! Do not use third party tools top pre-process the Illumina data before providing it to MaSuRCA" apply to MP libraries. I'm guessing that Masurca uses kmer coverage data to compute lots of stuff and is sensitive to different trimming parameters. Since MP data is pretty biased and requires processing to be useful, I'm guessing it doesn't get used for this purpose. I have tried running Masurca with PE, PacBio and MP(unprocessed) data and the assemblies were very poor compared to using just PE and PacBio data (n50 of ~30 vs. 220kb, respectively), so I'm guessing there is at least something wrong with my raw MP data. Right not I'm running it again including the split MP data. This splitting was preceded by a trimming step recommended for bbtools splitnextera (https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/split-nextera-guide/).

Related question: Does the MP data just come into play for the assembly? Can I use an existing output directory and rerun the assembly process including the MP data? I deleted the PE, MP, PB masurca directory and don't remember if there were MP.cor files :(

Thanks, Earl

(also, thanks for developing this assembler!)

alekseyzimin commented 6 years ago

Masurca has module to remove erroneous reads from Illumina MP libraries, but it assumes that MP libraries are supplied after Illumina post-processing, that is the reads are already split. They could be innies --->...<---- or outties <----...-----> but they should be split at Nextera adapters.

EarlyEvol commented 6 years ago

Ah thanks for the reply. Is it a problem that I did quality and Illumina adapter trimming before the Nextera sequence read splitting? Is Masurca sensitive to MP trimming parameters?

alekseyzimin commented 6 years ago

Not really a problem, MPs are only used for scaffolding. You can use your trimmed MP reads. The warning applies mostly to illumina PE reads.

On Mon, Apr 2, 2018 at 4:26 PM, BurlEarl notifications@github.com wrote:

Ah thanks for the reply. Is it a problem that I did quality and Illumina adapter trimming before the Nextera sequence read splitting? Is Masurca sensitive to MP trimming parameters?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/11#issuecomment-378033850, or mute the thread https://github.com/notifications/unsubscribe-auth/AZ9zHSAuPl9RyVEtvq92VfeozOTEi9SKks5tkomDgaJpZM4TB_p0 .

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://masurca.blogspot.com

EarlyEvol commented 6 years ago

Awesome. Thanks.