Insertions and duplications

ggstatgen commented 5 years ago

Hi

I was hoping I could get some help with the interpretation of the results in the vcf file. I assume SMRTsv2 vcf follow the specs of the official VCF 4.2 document, however I find that document quite terse so I thought I'd double check with you.

1) How to interpret INS calls?

My understanding is that POS indicates the reference coordinate where the INS fragment inserts. SEQ is the sequence of the INS, which is SVLEN long. Is this right? Am I correct in that SMRT-sv does not attempt reporting the origin of the inserted sequence? I blasted the fasta sequence in one of the INS calls I got, and this originated in one of the extra-chromosomal contigs packaged with hg38, However I could not find this info explicitly mentioned in the vcf. Is this because the origin is not always clear? Finally, and if the above is true, what is the field END indicating in case of INS calls? In your calls from the Audano et al paper, the bed file does not include this field. Instead, it seems to treat insertions as [start-1,start] 0nt events.

2) Duplications. SMRT-sv2 does not explicitly call dups. Are these implicitly included in the INS set?

Thanks so much for your help!

paudano commented 5 years ago

First, I apologize for taking so long to respond. My SPAM filters decided that GitHub notifications were junk.

Yes, POS is the reference coordinates where the insertions occurs. For symbolic SVs (those where ALT is "<INS>" or "<DEL>"), the position is the base before the insertion, so it really inserts at POS (i.e. the reference base at POS is the first base shifted downstream by the insertion).

SMRT-SV does not report the origin of the inserted sequence. This is far too complicated of a problem for a variant caller, and if automated, should be part of a separate pipeline. What we did was map inserted sequences back to the reference to see where thy came from (we used BLASR instead of BLAST, but we are switching to minimap2). For transposable elements, you may find candidate donor sites doing this, but you'd have to search SV insertions as well since it could come from a segment of DNA not in the reference. Many of the remaining SVs will be tandem duplications.

It is not surprising that you found insertion donors from unplaced and unlocalized reference contigs (chrUn... and chr*_random). Those are rich in segmental duplications. So you either found a duplication that was real, you anchored an unplaced/unlocalized contig in a chromosome, or more likely, it's a mapping error mediated by segmental duplications. I assume the location of the insertion is in a high-identity segmental duplication. SMRT-SV relies on read mapping, and it's only as confident at the loci if aligners can properly assign reads.

For insertions, you can ignore the END field. For deletions, the end field is the end of the deletion (which could also be found by adding SVLEN to POS). BED files give reference bases, so POS and END for an insertion is just the point of insertion.

Duplications are not called by SMRT-SV, but you can use the alignments SMRT-SV produces to find them. I use "bedtools makewindows" to tile a uniform sized window over the reference, then I pull reads from each window and count the depth. From that, you can get the mean window size (remove 0 depth windows to avoid biases from centromeres and unmappable regions), and find contiguous windows with 2.8+ to 3+ s.d. from the mean.

With long reads, what is interpreted as a duplication and an insertion will be different from short reads. For example, if you map insertions back to the reference, you should find that many of them intersect DUP calls from an Illumina caller, but you are able to sequence-resolve them with long reads. If you call duplications from long-read alignments, what you will find are events that are so large that long reads do not uniquely map.

Let me know if I can help you further with these questions.

ggstatgen commented 5 years ago

Hi Peter

Thanks so much for the extended reply and sorry for the delay in answering - I was away for a while.

One follow up question if I may re the DUP calling algorithm you suggest - what coverage have you used in your tests when using this strategy? Would you say it would work with good sensitivity only with coverage in the 50X-100X range you have in the 15 samples of your paper or have you had experience with successfully calling dups from say 20-30X?

paudano commented 5 years ago

I expect 30x would work, but allelic variation will make it less reliable below 30x. You might get some descent results with 20x, but with more noise. I don't yet have an analysis to share on lower coverage data. Test it and see if the results are similar, i.e. almost all large duplicates should intersect high-identity segmental duplications (Fig. S3C in the Cell paper). DUPs were compared UCSC " Segmental Dups" track, and I used field "fracMatchIndel" to determine SD identity.

Note: Some of the shorter reads will map to transposable elements and create apparent duplications as well, but larger reads containing the same repeat may be anchored elsewhere. So, you could end up with two representations of these (one as a DUP and one as an INS), so you would need to be careful about how small DUPs over these repeats were interpreted.

EichlerLab / smrtsv2

Insertions and duplications #36