ConvertAllele and Variant normalization

bioinformatics-centre / BayesTyper

A method for variant graph genotyping based on exact alignment of k-mers

87 stars 7 forks source link

ConvertAllele and Variant normalization #19

Open malmarri opened 4 years ago

malmarri commented 4 years ago

Hi,

Firstly thank you for creating such a great resource. I have two questions (using v1.5):

1) I am trying to genotype SVs using identified by manta (only SVs), when following the steps and using bayesTyperTools convertAllele almost 20% of variants are skipped, for example:

Skipped 1219 unsupported allele(s):

    - 307 <INS> alternative allele(s)
    - 912 translocation alternative allele(s)

In your new version I understand that there is added support for insertions, is there a way to rescue these skipped insertions?

2) At the variant normalization step using bcftools norm a significant number of variants suffer from errors and are not normalized, for example:

Non-ACGTN reference allele at chr3:52803269

Do you have any recommendations for this? I tried the bcftools norm --check-ref ws to fix 'bad sites'.

Best wishes, Mo

jonassibbesen commented 4 years ago

Hi Mo,

Thank you for your interest in our tool. BayesTyper does not support translocations and these will therefore always to filtered by convertAllele. For the insertions the inserted sequence needs to be present in either the INFO field or given as a separate fasta file (--alt-file). BayesTyper needs the sequence of the insertion on order to be able to genotype it. For Manta specifically, you can use the --keep-partial option which will allow partial insertions (insertion where only the left and right side is known and not the whole whole sequence) to all be added. The left and right side of the partial insertion is connected with N's.

Regarding the normalization it sounds like either you are running on a vcf still containing symbolic alleles (like \<INS> or \<DEL>) or that your sequences in the file contains nucleotides that are not A, C, G, T or N.

Please let me know if you have any other questions.

Best,

Jonas

malmarri commented 4 years ago

Thank you for your quick reply Jonas, I very much appreciate it. I just have a few more questions:

1) If i'm only attempting to genotype structural variation, do I still need SNPs and INDELS from these samples included in the candidate_variants file? or are the SNPs and INDELS in the variation prior sufficient?

2) Is the variation prior necessary if I have a large high coverage human dataset (around 1000 samples) from over 30 diverse human populations? My dataset captures the (at least common) variation from most humans populations, so I was wondering whether the prior is needed in this case.

3) Counting kmers for just one sample using kmc results in a huge file, 150GB, and the subsequent bloom step creates a 50GB file. Creating this for the whole dataset requires very large space requirements, I just want to ask if this is normal? (Or I might have been doing something wrong), and if it is, do you have any recommendation on this.

Thanks again, Mo

jonassibbesen commented 4 years ago

Hi, Sorry for the delayed reply.

I would recommend using the SNVs and indels from the samples if possible. These variations are important to correctly match kmers in the sequencing reads to the SVs. Given that the prior only contains common SNVs there is likely going to many SNVs close be to SVs that are only in your samples.
The prior is not strictly necessary, but are used to potentially increase sensitivity if your candidate set does not contain all putative variations. You are right that in your case using the prior might not provide as big of an advantage.
Yes, the kmc output can get quite big. I am a bit surprised that the bloom filter is also that big. Normally it is closer to ~20 GB in my experience using high-coverage data (~35x). One trick you can use is to filter singleton kmers (only observed once) by removing the -ci1 option. This should result in far less kmers and thus smaller file sizes. It might result in lower accuracy, but for high coverage data it should not be by much.

Please let me know if you have any other questions.

Best,

Jonas