Please make the 50 kbp molecule inference parameter configurable

10XGenomics / lariat

Linked-Read Alignment Tool

https://support.10xgenomics.com/genome-exome/software/pipelines/latest/algorithms/overview

MIT License

27 stars 7 forks source link

Please make the 50 kbp molecule inference parameter configurable #5

Open sjackman opened 7 years ago

sjackman commented 7 years ago

if i == 0 || (i > 0 && position_list[i].pos-position_list[i-1].pos > 50000) {

https://github.com/10XGenomics/lariat/blob/fca47561ae43f47b0f71f98f7f17598d508af440/go/src/inference/lariat.go#L1367

sjackman commented 7 years ago

psitchensiscpmt_2 scaffold4 spanners bam

The 8 inferred molecules shown in this IGV screen shot should in fact each be two separate molecules. There is a misassembly shown in red at 109,000 bp (scaffold 4). The region to the left of the red bar and to the right are not in fact proximal due to the misassembly, and so should have no molecules spanning the misassembly. These 8 molecules incorrectly support the misassembly, complicating the ability to detect the misassembly.

There is about a 30 kbp gap between the two reads on either side of the misassembly for each molecule. The non-uniform density of the reads across each molecule is an indication that each molecule should in fact be two molecules.

sjackman commented 7 years ago

psitchensiscpmt_2 scaffold4 spanners bam 2

The six reads aligned at 85 kbp map to a region of 16 consecutive C nucleotides CCCCCCCCCCCCCCCC, with soft clipping at either side of the homopolymer run. The mapping quality of these six reads is 60, which is unexpectedly high, as there is a second scaffold that also contains the sequence CCCCCCCCCCCCCCCC. What mapping quality does Lariat assign to a read that maps ambiguously without its barcode, but is placed uniquely using its barcode? All six reads have poor alignment scores of AS:f < -140, so I'll filter them out based on alignment score.

sjackman commented 7 years ago

With further inspection I've discovered that only 1 of the 8 cases is two separate molecules that are 30 kbp apart. The other 7 cases are a single read being rescued by Lariat and incorrectly mapped somewhere within 50 kbp of the end of the molecule, extending the molecule out by up to 50 kbp in that direction. These misaligned reads are fairly easily filtered out by their poor alignment score (in my case 5 alignments around AS:f of -140, one at -46.5, and one at -30).