Open sjackman opened 7 years ago
The 8 inferred molecules shown in this IGV screen shot should in fact each be two separate molecules. There is a misassembly shown in red at 109,000 bp (scaffold 4). The region to the left of the red bar and to the right are not in fact proximal due to the misassembly, and so should have no molecules spanning the misassembly. These 8 molecules incorrectly support the misassembly, complicating the ability to detect the misassembly.
There is about a 30 kbp gap between the two reads on either side of the misassembly for each molecule. The non-uniform density of the reads across each molecule is an indication that each molecule should in fact be two molecules.
The six reads aligned at 85 kbp map to a region of 16 consecutive C nucleotides CCCCCCCCCCCCCCCC
, with soft clipping at either side of the homopolymer run. The mapping quality of these six reads is 60, which is unexpectedly high, as there is a second scaffold that also contains the sequence CCCCCCCCCCCCCCCC
. What mapping quality does Lariat assign to a read that maps ambiguously without its barcode, but is placed uniquely using its barcode? All six reads have poor alignment scores of AS:f < -140
, so I'll filter them out based on alignment score.
With further inspection I've discovered that only 1 of the 8 cases is two separate molecules that are 30 kbp apart. The other 7 cases are a single read being rescued by Lariat and incorrectly mapped somewhere within 50 kbp of the end of the molecule, extending the molecule out by up to 50 kbp in that direction. These misaligned reads are fairly easily filtered out by their poor alignment score (in my case 5 alignments around AS:f
of -140, one at -46.5, and one at -30).
https://github.com/10XGenomics/lariat/blob/fca47561ae43f47b0f71f98f7f17598d508af440/go/src/inference/lariat.go#L1367