bioinformatics-centre / BayesTyper

A method for variant graph genotyping based on exact alignment of k-mers
86 stars 7 forks source link

terminate called after throwing an instance of 'std::out_of_range' #30

Closed Parsoa closed 4 years ago

Parsoa commented 4 years ago

I get the following error after running BayesTyper on a VCF file generated by Paragraph:

bayesTyper cluster -v /share/hormozdiarilab/Codes/data/HGSV/Unified/HG00514_HG00733.merged_nonredundant.unified.paragraph.sorted.vcf.gz -s samples.tsv -g /share/Data/ReferenceGenomes/Hg38/hg38.fa -p 16

[08/05/2020 12:22:37] You are using BayesTyper (v1.5)

[08/05/2020 12:22:37] Seeding pseudo-random number generator with 1588965757 ...
[08/05/2020 12:22:37] Setting the kmer size to 55 ...

[08/05/2020 12:22:37] Parsed information for 1 sample(s)

[08/05/2020 12:22:37] Parsing reference genome ...
[08/05/2020 12:22:48] Parsed 58 reference genome chromosomes(s) (2638915210 nucleotides)

[08/05/2020 12:22:48] Parsing decoy sequence(s) ...
[08/05/2020 12:22:48] Parsed 0 decoy sequence(s) (0 nucleotides)

[08/05/2020 12:22:59] Setting the number of inference units to 1 across 11882 variants ...

[08/05/2020 12:23:13] Maximum resident set size: 11.6129 Gb

[08/05/2020 12:23:13] Parsing variants in unit 1 ...
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 40917102) > this->size() (which is 40799484)
[1]    30690 abort       cluster -v  -s samples.tsv -g  -p 16

I tried looking for variants at or near the 40917102 on any chromosome and the only thing I find is a deletion on chr8 at 40917103:

chr8    40917103        DEL00004034;DEL00004141 G       <DEL>   30      .       END=40922258;SVTYPE=DEL;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60     GT      ./.     ./.

These coordinates fall within the boundaries of the chromosome so I don't see why I'm getting this error.

jonassibbesen commented 4 years ago

Thank you for writing. Yes, this error suggest that the coordinates are outside the length of the chromosome. Is this human data? If so, there might actually be something wrong with either your reference or alternatively how BayesTyper parses it. The log from BayesTyper says that it only parsed 2638915210 nt. However we would expect more from a human genome. To be sure it is not the reference would it be possible for you to run samtools faidx on the genome and look at the resulting index to confirm the chromosome length?

Parsoa commented 4 years ago

You are correct. Seems like the reference file I have is bogus. I switched to a new reference and now it seems to work fine, however I now get a new error:

[13/05/2020 22:12:32] Parsing variants in unit 1 ...
bayesTyper: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.5_static/BayesTyper-1.5/src/bayesTyper/VariantFileParser.cpp:230: bool VariantFileParser::constructVariantClusterGroups(InferenceUnit*, uint, const Chromosomes&): Assertion `(total_num_variants != num_variants) == variants_infile_fstream.good()' failed.
[1]    33028 abort       cluster -v  -s samples.tsv -g  -p 16

Should I open a new issue?

jonassibbesen commented 4 years ago

BayesTyper does not support bgzip compressed files (only gzip). This error normally arises when those are used.

Parsoa commented 4 years ago

You are correct and I had indeed passed a gzip file. Thanks.