eldariont / svim

Structural Variant Identification Method using Long Reads
GNU General Public License v3.0
149 stars 19 forks source link

[BUG] Unsorted positions on sequence #61

Open Irallia opened 2 years ago

Irallia commented 2 years ago

Bug description

Running SVIM on some long read PacBio bam files, I get an error from tabix, that the resulting sequences are unsorted.

[E::hts_idx_push] Unsorted positions on sequence #1: 2345914 followed by 2345913
tbx_index_build failed: MtSinai_PacBio/variants.vcf.gz

How to reproduce:

Running SVIM on MtSinai PacBio data BAM: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_MtSinai_NIST/PacBio_minimap2_bam/HG002_PacBio_GRCh38.bam Reference genome (mentioned in readme): GRCh38/hg38 - ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

SVIM call:

svim alignment --sample HG002 --partition_max_distance 1000 --cluster_max_distance 0.5 --min_sv_size 40 --segment_gap_tolerance 20 --segment_overlap_tolerance 20 --read_names --max_sv_size 1000000 MtSinai_PacBio/ ../data/long_reads/HG002_PacBio_GRCh38.bam ../data/reference/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna &>> 2022_01_20_svim_output.MtSinai_PacBio.log

bgzip call:

bgzip -c MtSinai_PacBio/variants.vcf > MtSinai_PacBio/variants.vcf.gz

tabix call:

tabix -p vcf MtSinai_PacBio/variants.vcf.gz

resulting error:

[E::hts_idx_push] Unsorted positions on sequence #1: 2345914 followed by 2345913
tbx_index_build failed: MtSinai_PacBio/variants.vcf.gz

Grep positions in vcf

$ less MtSinai_PacBio/variants.vcf | grep -n -sw 2345914
5386:chr1       2345914 svim.BND.216    N       ]chr1:113504903]N       1       PASS    SVTYPE=BND;SUPPORT=1;STD_POS1=.;STD_POS2=.;READS=m141223_012647_42163R_c100748432550000001823169907081526_s1_p0/57386/7145_37078     GT:DP:AD        ./.:.:.,.
277551:chr1     113504903       svim.BND.14706  N       N[chr1:2345914[ 1       PASS    SVTYPE=BND;SUPPORT=1;STD_POS1=.;STD_POS2=.;READS=m141223_012647_42163R_c100748432550000001823169907081526_s1_p0/57386/7145_37078     GT:DP:AD        ./.:.:.,.
2545447:chr6    5537590 svim.INS.2345914        C       CGGAAGGCTCCTCTTCGATGGGTTGGGAGATGGGGTGTCCGGAACCTGGGATCACTTACACTCTCGTAGTTGGAGATGTCCA      1       PASSSVTYPE=INS;END=5537590;SVLEN=81;SUPPORT=1;STD_SPAN=.;STD_POS=.;READS=m150212_124930_42177R_c100777662550000001823160908051502_s1_p0/46250/4456_23453     GT:DP:AD     ./.:.:.,.

$ less MtSinai_PacBio/variants.vcf | grep -n -sw 2345913
5387:chr1       2345913 svim.INS.4708   G       GGCGGATGATGTATATATGATGGCTTGATGGTCGTTCGCTGCTGTGCTAGGCACAGCATAGAGGCTCAGCATCGAGGCTATATCTCA 1       PASS    SVTYPE=INS;END=2345913;SVLEN=86;SUPPORT=1;STD_SPAN=.;STD_POS=.;READS=m150210_124418_42163R_c100779642550000001823165208251543_s1_p0/129717/19664_36011       GT:DP:AD     ./.:.:.,.
2545446:chr6    5537511 svim.INS.2345913        C       CCTCATACCCCTCCCCTCCCCCCCCCTCCCCATCTCTGAGTTGGAGATGCTCCCGCCCGGGCCTCCTCGTCGAGTTGGAGAAGCTGTACCCGCGG 1   PASS     SVTYPE=INS;END=5537511;SVLEN=94;SUPPORT=1;STD_SPAN=.;STD_POS=.;READS=m150130_022735_42156_c100779802550000001823165208251520_s1_p0/101304/0_3889    GT:DP:AD ./.:.:.,.

-n: with line numbers, -sw: only whole words

Irallia commented 2 years ago

For another dataset:

$ svim alignment --sample HG002 --partition_max_distance 1000 --cluster_max_distance 0.5 --min_sv_size 40 --segment_gap_tolerance 20 --segment_overlap_tolerance 20 --read_names --max_sv_size 1000000 10X_Genomics/ ../data/long_reads/NA24385_phas5ed_possorted_bam.bam ../data/reference/hg19.reordered.fa &>> 2022_01_20_svim_output.10X_Genomics.log
$ bgzip -c 10X_Genomics/variants.vcf > 10X_Genomics/variants.vcf.gz
$ tabix -p vcf 10X_Genomics/variants.vcf.gz
[E::hts_idx_push] Unsorted positions on sequence #1: 1453683 followed by 1453682
tbx_index_build failed: 10X_Genomics/variants.vcf.gz

And grep gives me:

$ less 10X_Genomics/variants.vcf | grep -n -sw 1453683
1403:chr1       1453683 svim.BND.671    N       N]chr2:195764355]       1       PASS    SVTYPE=BND;SUPPORT=1;STD_POS1=.;STD_POS2=.;READS=HISEQ-002:393:C7A29ANXX:5:1107:4573:15814   GT:DP:AD        ./.:.:.,.
1286653:chr2    195764355       svim.BND.725457 N       N]chr1:1453683] 1       PASS    SVTYPE=BND;SUPPORT=1;STD_POS1=.;STD_POS2=.;READS=HISEQ-002:393:C7A29ANXX:5:1107:4573:15814   GT:DP:AD        ./.:.:.,.
2566115:chr4    180906971       svim.BND.1453683        N       ]chr13:78951248]N       1       PASS    SVTYPE=BND;SUPPORT=1;STD_POS1=.;STD_POS2=.;READS=HISEQ-002:393:C7A29ANXX:4:1209:12410:91014  GT:DP:AD        ./.:.:.,. 
5882736:chr12   58346928        svim.INV.1453683        AGT...ACT 0       PASS    SVTYPE=INV;END=58524222;SUPPORT=1;STD_SPAN=.;STD_POS=.;READS=HISEQ-002:393:C7A29ANXX:5:2213:9562:27964    GT:DP:AD        ./.:.:.,.

$ less 10X_Genomics/variants.vcf | grep -n -sw 1453682
1404:chr1       1453682 svim.DEL.80     ACA...ACA   A       1    PASS     SVTYPE=DEL;END=1454098;SVLEN=-416;SUPPORT=1;STD_SPAN=.;STD_POS=.;READS=HISEQ-002:393:C7A29ANXX:2:1308:14440:38213       GT:DP:AD        ./.:.:.,.
2566114:chr4    180906772       svim.BND.1453682        N       N]chr2:105418649]       1       PASS    SVTYPE=BND;SUPPORT=1;STD_POS1=.;STD_POS2=.;READS=HISEQ-002:393:C7A29ANXX:4:1206:7043:15405    GT:DP:AD        ./.:.:.,.
5882734:chr12   58346909        svim.INV.1453682        GGG...CCC      0       PASS    SVTYPE=INV;END=58347496;SUPPORT=1;STD_SPAN=.;STD_POS=.;READS=HISEQ-002:393:C7A29ANXX:4:1316:12304:42436       GT:DP:AD        ./.:.:.,.
Irallia commented 2 years ago

I tried to sort the file with picard:

$ picard SortVcf -I MtSinai_PacBio/variants.vcf -O MtSinai_PacBio_variants.vcf -Xms1g -Xmx100g --TMP_DIR ../tmp/picard/ &>> 2022-01-26_picard_output.MtSinai_PacBio.log

but this results in an error aswell:

Runtime.totalMemory()=5727846400
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 819567: unparsable vcf record with allele TGA...CCT, for input source: file:///.../MtSinai_PacBio/variants.vcf
        at htsjdk.variant.vcf.AbstractVCFCodec.generateException(AbstractVCFCodec.java:887)
        at htsjdk.variant.vcf.AbstractVCFCodec.checkAllele(AbstractVCFCodec.java:678)
        at htsjdk.variant.vcf.AbstractVCFCodec.parseAlleles(AbstractVCFCodec.java:640)
        at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:443)
        at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:384)
        at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:328)
        at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:48)
        at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:70)
        at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:37)
        at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:375)
        at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:354)
        at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:315)
        at picard.vcf.SortVcf.sortInputs(SortVcf.java:164)
        at picard.vcf.SortVcf.doWork(SortVcf.java:98)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

here is the variant:

$ sed '819567q;d' MtSinai_PacBio/variants.vcf
chr2    108832254   svim.DEL.17471  TGA...CCT   T   1   PASS    SVTYPE=DEL;END=109522633;SVLEN=-690379;SUPPORT=1;STD_SPAN=.;STD_POS=.;READS=m150131_015113_42163R_c100780292550000001823166508251570_s1_p0/103911/7208_10342    GT:DP:AD    ./.:.:.,.

This could be another problem, as picard runs through for the 10X Genomics dataset.