cerebis / meta-sweeper

Parametric sweep of simulated microbial communities and metagenomic sequencing.
GNU General Public License v3.0
10 stars 0 forks source link

Longer read-pairs (250bp) can lead to read-through problems. Add a warning. #45

Closed cerebis closed 7 years ago

cerebis commented 7 years ago

This is something I have already worked out but its amusing enough to mention as an issue. It has been resolved in commit 375f4ef20ca69cdb88c2d6e7e7bc1d06765f8679.

I was seeing badly mapping reads since updating simForward. I regress the work sufficiently to include the old read-generation methods and still saw the same results -- and the reads were identical for old and new error-free modes, with error simulation differing by the expected degree. Still this didn't explain the mapping!

As it turns out, I was previously running simForward with 150bp sequencing. This is the key! The default fragment size was short enough that for 250bp sequencing (MiSeq mode) there was frequent read-through, which then produces soft and hard clipping in BWA MEM.

So, I've basically tested my own methods -- inadvertantly -- for read-through issues...

I have added some statistics tracking and a sanity check at the end of the run, so as to let users know if they've suffered read-throughs, etc. An upfront check could be added, based on the requested fragment mean and read length.

koadman commented 7 years ago

looks like massive whitespace changes in that commit so i can't easily find the changes relevant to this issue. To make sure I understand, are you talking about fragments shorter than 250nt suffering readthrough into the adapter? or fragments between 250 & 500nt having overlapping reads? Or something else entirely?

cerebis commented 7 years ago

The insert size of the modelled ligation products.

I inadvertently increased read length from 150bp to 250bp and didn't change the criteria governing the length of the ligation product. Being that they were around 450bp in extent, the junction point was only 225bp away from either end and thus in easy reach of both reads.

That then lead to many chimeric reads and lots of clipping in BWA. Imposing a strict near-complete alignment then meant lots were rejected.

That make sense? Too much waffling on my part.

koadman commented 7 years ago

I see, so the current code is able to simulate a read through the ligation junction? (sorry just want to double-check since i can't see the code changes). That would be great if so -- there are definitely important test cases for how it impacts assembly and whether some clever analysis can fix it. It's not always easy to control the fragment size in the lab protocols.

cerebis commented 7 years ago

Ok, so this as been dealt with. Read-through events are tracked and just mentioned at the end. No special indications are made in the output file

koadman commented 7 years ago

sounds good! perhaps a separate issue of very minor importance, but there is a different kind of readthrough we might consider: imagine the library prep creates a 150nt fragment, which we then sequence with 250nt reads. The first 150nt go through the fragment, but the next 65 or so go into the adapter sequence, and anything beyond that comes out as random gibberish or sometimes as a big string of A's.