EichlerLab / smrtsv2

Structural variant caller
MIT License
53 stars 6 forks source link

Extract alt allele sequence #22

Closed rmccoy7541 closed 5 years ago

rmccoy7541 commented 5 years ago

Hello, I am wondering whether you have any advice about how to extract the sequence of the alternative allele (rather than symbolic ) using the information from VCFs output by smrtsv2. For example, taking the first VCF entry of your recent long-read sequencing study,

chr1 59599 NA19434_chr1-59599-INS-308 A <INS> 5 . SVTYPE=INS;SVLEN=308;END=59599;MERGE_SOURCE=NA19434;MERGE_SAMPLES=NA19434;MERGE_AC=1;MERGE_AF=0.07;MERGE_VARIANTS=NA19434_chr1-59599-INS-308;MERGE_VARIANTS_RO=1.00;CONTIG_SUPPORT=3;CONTIG_DEPTH=7;CONTIG=NA19434_chr1-20000-80000-ctg7180000000004;CONTIG_START=3817;CONTIG_END=4125;REPEAT_TYPE=AluY_simple;BKPTID=NA19434_chr1-59599-INS-308;PUBLISHED_ID=NA19434_chr1-59599-INS-308

I tried extracting the inserted sequence using samtools faidx GCA_003709735.1_NA19434_EEE_SV-Pop.1_genomic.fna QVRF01000001.1:3817-4125

However, aligning this with sequence flanking the insertion seems to suggest that the insertion point doesn't line up perfectly.

CLUSTAL O(1.2.4) multiple sequence alignment
chr1:58799-60399              -------------------TGGTCTTTTCCTCCAGACAAGCTCCTTTGGGTCATCAGGAT  41
QVRF01000001.1:3817-4125      ------------------------------------------------------------  0
QVRF01000001.1:3000-5000      GGCAGAGAAATCAAACACATGGTCTTTTCCTCTAGACAAGCTCCTTTGGGTCATCAGGAT  60

chr1:58799-60399              TTCTTCAACAATAAAATGTAATAATTCCAAATGTTTGTAACAGAATGGGTAGGACTTTCT  101
QVRF01000001.1:3817-4125      ------------------------------------------------------------  0
QVRF01000001.1:3000-5000      TTCTTcaacaataaaatgtaataattccAAATGCTTGTAACAGAATGGGTAGGACTTTCT  120

chr1:58799-60399              TCACTTATTTAAATACTCCCTTTTTTATGCAACTGAGTTTTCATCAACAAGTACAAGCTT  161
QVRF01000001.1:3817-4125      ------------------------------------------------------------  0
QVRF01000001.1:3000-5000      TCACTTATTTAAATACTCCCTTTTTTATGCAACTGAGTTTTCATCAACAAGTACAAGCTT  180

chr1:58799-60399              GTGAAGGAGTACTTTAAAATGCAATTTCTCTCTATTTTTGTGGGGGCTAATATTTTATTT  221
QVRF01000001.1:3817-4125      ------------------------------------------------------------  0
QVRF01000001.1:3000-5000      GTGAAGGAGTActttaaaatgcaatttctcTCTATTTT-GTGggggctaatattttattt  239

chr1:58799-60399              CTCATATTGACAATTTATTATGCTGTTTTT-AAAAAGttcattcatcaagtatttcttga  280
QVRF01000001.1:3817-4125      ------------------------------------------------------------  0
QVRF01000001.1:3000-5000      ctcatattgaCAATTTATTATGCTGTTTTTAGAAAAGTTCATTCATCAAGTATTTCTTGA  299

chr1:58799-60399              gctttttctatgagacaggcactgttttaggcaagtaattatgcactgaacaatgcaaaa  340
QVRF01000001.1:3817-4125      ------------------------------------------------------------  0
QVRF01000001.1:3000-5000      GCTTTTTCTATGAGacaggcactgttttaggcaaGTAATTATGCACTGAACAATGCAAAA  359

chr1:58799-60399              agtttccctgcactcatggactttaattttacatttatgaaaagctacaaatattagaat  400
QVRF01000001.1:3817-4125      ------------------------------------------------------------  0
QVRF01000001.1:3000-5000      AGTTTCCCTGCACTCATGgactttaattttacatttatgaaaAGCTACAAATATTAGAAT  419

chr1:58799-60399              aagtaaaataCTGCCTGGAGGCTAAAGCATATTTTGATCACTTATTCCCTAATTCTTTTA  460
QVRF01000001.1:3817-4125      ------------------------------------------------------------  0
QVRF01000001.1:3000-5000      AAGTAAAATACTGCCTGGAGGCTAAAGCATATTTTGATCACTTATTCCCTAATTCTTTTC  479

chr1:58799-60399              GAAGAGAACTCACCTGTCGGTTAGCTGAACCACTGCCAGTGATATCCAACTATACATTCA  520
QVRF01000001.1:3817-4125      ------------------------------------------------------------  0
QVRF01000001.1:3000-5000      GAAGAGAACTCACCTGTCGGTTAGCTGAACCACTGCCAGTGATATCCAACTATACATTCA  539

chr1:58799-60399              ATCCCACCATACCTCATTATCACACCTATTCACTCACAAGCTTAAACTCTTAACTTTTCT  580
QVRF01000001.1:3817-4125      ------------------------------------------------------------  0
QVRF01000001.1:3000-5000      ATCCCACCATACCTCATTATCACACCTATTCACTCACAAGCTTAAACTCTTAACTTTTCT  599

chr1:58799-60399              CCACATATCAGTGACTATTTCCTACAGCTTTTCTTTTACTTTCCATGTTTGCAGTGACAA  640
QVRF01000001.1:3817-4125      ------------------------------------------------------------  0
QVRF01000001.1:3000-5000      CCACATATCAGTGACTATTTCCTACAGCTTT-CTTTTACTTTCCATGTTTGCAGTGACAA  658

chr1:58799-60399              TATACATAAACAGTGTATGAAAACTCAAGTAAAATCTACTCTCTCAGGTGTTCATAATGT  700
QVRF01000001.1:3817-4125      ------------------------------------------------------------  0
QVRF01000001.1:3000-5000      TATACATAAACAGTGTATGAAAACTCAAGTAAAATCTACTCTCTCAGGTGTTCATAATGC  718

chr1:58799-60399              ATCAATGTATATTGCTTTAAGCCTGAAGGTAACCTAAGTAAAGATGTACCATGTTCCACC  760
QVRF01000001.1:3817-4125      ------------------------------------------------------------  0
QVRF01000001.1:3000-5000      ATCAATGTATATTGCTTTAAGCCTGAAGGTAACCTAAGTAAAGATGTACCATGTTCCACC  778

chr1:58799-60399              AATGCTTCTTTTGATCATCATTTTATCCTGTTTTTTCTTTAGGATTCTT-----------  809
QVRF01000001.1:3817-4125      ---------------------------------------taggattctttcttttttttt  21
QVRF01000001.1:3000-5000      AATGCTTCTTTTGATCATCATTTTATCCTGTTTTTtctttaggattctttcttttttttt  838
                                                                     **********           
chr1:58799-60399              ------------------------------------------------------------  809
QVRF01000001.1:3817-4125      ttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagcggcgcgat  81
QVRF01000001.1:3000-5000      ttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagcggcgcgat  898

chr1:58799-60399              ------------------------------------------------------------  809
QVRF01000001.1:3817-4125      ctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctccca  141
QVRF01000001.1:3000-5000      ctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctccca  958

chr1:58799-60399              ------------------------------------------------------------  809
QVRF01000001.1:3817-4125      agtagctgggactacaggcgccgccactacgcccggctaattttttgtttttagtagaga  201
QVRF01000001.1:3000-5000      agtagctgggactacaggcgccgccactacgcccggctaattttttgtttttagtagaga  1018

chr1:58799-60399              ------------------------------------------------------------  809
QVRF01000001.1:3817-4125      cggggtttcaccgttttagccgggatggtctcgatctcctgacttcgtgatcctcccgcc  261
QVRF01000001.1:3000-5000      cggggtttcaccgttttagccgggatggtctcgatctcctgacttcgtgatcctcccgcc  1078

chr1:58799-60399              ---------------------------------------------------------TCT  812
QVRF01000001.1:3817-4125      tcggctccaaagtgctgggattacaggcgtgagccaccgcgcccggcc------------  309
QVRF01000001.1:3000-5000      tcggctccaaagtgctgggattacaggcgtgagccaccgcgcccggccaggattcTTTCT  1138

chr1:58799-60399              TATTCCTTCCCCTGACCCTTCTTTTATTCTCCAAATTTCTTTCCAATTCATCTTTGTTCT  872
QVRF01000001.1:3817-4125      ------------------------------------------------------------  309
QVRF01000001.1:3000-5000      TATTCCTTCCCCTGACCCTTCTTTTATTCTCCAAATTTCTTTCCAATTCATCtttgttct  1198

chr1:58799-60399              TCCCTTTCCTTTTTACTCTCTTTAAACATTCTATGGACTCTGCCTCCTTCACACTGATAT  932
QVRF01000001.1:3817-4125      ------------------------------------------------------------  309
QVRF01000001.1:3000-5000      tccctttcctttttactCTCTTTAAACATTCTATGGACTCTGCCTCCTTCACACTGATAT  1258

chr1:58799-60399              TGAACGCCCATAGTTTCATATTTTGGATTGCGATTGTTTTATTTTAAAATGGCAAATGTT  992
QVRF01000001.1:3817-4125      ------------------------------------------------------------  309
QVRF01000001.1:3000-5000      TGAACGCCCATAGTTTCATATTTTGGATTgcgattgttttattttaaaatggcaaatgtt  1318

chr1:58799-60399              CATGTTATAAAGAGAATTTTTCAGTCTTTAGACTAATAGGTTCATGTAGTTTGGGATTTT  1052
QVRF01000001.1:3817-4125      ------------------------------------------------------------  309
QVRF01000001.1:3000-5000      CATGTTATAAAGAGAATTTTTCAGTCTTTAGACTAATAGGTTCATGTAGTTTGGGATTTT  1378

chr1:58799-60399              CCTCTTTAAGAAAATTAATTATCACTCACACTCCAAGACAAACACCATTTCAGTAGCAAT  1112
QVRF01000001.1:3817-4125      ------------------------------------------------------------  309
QVRF01000001.1:3000-5000      CctctttaagaaaattaattatcaCTCACACTCCAAGACAAACACCATTTCAGTAGCAAT  1438

chr1:58799-60399              ATGAATTTCAGTAGTAATAGGAATCTCCAAATATGACAAAGTAATTCAGACATTAATTGC  1172
QVRF01000001.1:3817-4125      ------------------------------------------------------------  309
QVRF01000001.1:3000-5000      ATGAATTTCAGTAGTAATAGGAATCTCCAAATATGACAAAGTAATTCAGACAttaattgc  1498

chr1:58799-60399              TTTTGTTTTGGAATTGCTCTTATAAGATGAAATATCACTTTCATGATGAGAGTCCTAGAG  1232
QVRF01000001.1:3817-4125      ------------------------------------------------------------  309
QVRF01000001.1:3000-5000      ttttgttttggaATTGCTCTTATAAGATGAAATATCACTTTCATGATGAGAGTCCTAGAG  1558

chr1:58799-60399              TGCTTGGTTTATATATTGTATCTTAGTTTTAACAGGATAAAACACTTGATCCTAAGCAGT  1292
QVRF01000001.1:3817-4125      ------------------------------------------------------------  309
QVRF01000001.1:3000-5000      TGCTTGGTTTATATATTGTATCTTAGTTTTAACAGGATAAAACACTTGATCCTAAGCAGT  1618

chr1:58799-60399              AAACATGATTCTTCAGCTTCAACTTCATTTCTTTATAAATAACTATTTATGAATTGGTGT  1352
QVRF01000001.1:3817-4125      ------------------------------------------------------------  309
QVRF01000001.1:3000-5000      AAACATGATTCTTCAGCTTcaacttcatttctttataaataactatTTATGAATTGGTGT  1678

chr1:58799-60399              TGAGCTTAGTAAGTCACCAAACACCTTCTGCTCAGCAGCATAAAGGACATTTCCATGAAA  1412
QVRF01000001.1:3817-4125      ------------------------------------------------------------  309
QVRF01000001.1:3000-5000      TGAGCTTAGTAAGTCACCAAACACCTTCTGCTCAGCAGCATAAAGGACATTTCCATGAAA  1738

chr1:58799-60399              CCTCCCAGGGATAATCTTATTTACTCTATAATGTTTCCCGGGTTCAATTCCTCTCCCAAA  1472
QVRF01000001.1:3817-4125      ------------------------------------------------------------  309
QVRF01000001.1:3000-5000      CCTCCCAGGGATAATCTTA-TTACTCTATAATGTTTCCCGGGTTCAATTCCTCTCCCAAA  1797

chr1:58799-60399              ATTCTTTGTTCTTAAGCCCCTATGATCTGGGTGATCTAAATATGGGTAAGAAGTCCAGGG  1532
QVRF01000001.1:3817-4125      ------------------------------------------------------------  309
QVRF01000001.1:3000-5000      ATTCTTTGTTCTTAAGCCCCTATGATCTGGGTGATCTAAATATGGGTAAGAAGTCCAGGG  1857

chr1:58799-60399              ATAGCACTATGAATGAAGTGAAAATAGTAAAACATAGTTAAAAATGTAcagatgctctct  1592
QVRF01000001.1:3817-4125      ------------------------------------------------------------  309
QVRF01000001.1:3000-5000      ATAGCACTATGAATGAAGTGAAAATAGTAAAACATAGTTAAAAATGTACAGATGCTCTCT  1917

chr1:58799-60399              gacttataa---------------------------------------------------  1601
QVRF01000001.1:3817-4125      ------------------------------------------------------------  309
QVRF01000001.1:3000-5000      GACTTATAATAGGGTTACGTCCTGATAAATccatcataagtcaaaaatgcatttaatatt  1977

chr1:58799-60399              ------------------------  1601
QVRF01000001.1:3817-4125      ------------------------  309
QVRF01000001.1:3000-5000      ccTAATGTACCTCACATCATAGTT  2001

Ultimately, I am hoping to be able to format the inserted sequence (for all insertions) as an alternative allele to be used as input to programs like BayesTyper. Thanks for your help!

paudano commented 5 years ago

The sequence is on the EBI FTP: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/hgsv_sv_discovery/working/20181025_EEE_SV-Pop_1/VariantCalls_EEE_SV-Pop_1/EEE_SV-Pop_1.ALL.sites.20181204.bed.gz

That BED file contains one record per SV, and the SEQ column is the SV sequence (inserted, deleted, or inverted bases).

This SV is an AluY, so the inserted sequence will align to any number of other regions within these contigs and the human reference.