AlgoLab / malva

genotyping by Mapping-free ALternate-allele detection of known VAriants
https://algolab.github.io/malva/
GNU General Public License v3.0
10 stars 4 forks source link

MALVA with low coverage data #6

Open OliverPStuart opened 3 years ago

OliverPStuart commented 3 years ago

Hi there,

We'd like to try using MALVA on our own low-coverage WGS data (~1x). We've noticed that the MALVA release we're using (version 1.3.1; build h3889886_0) is only genotyping sites where a sample has >=2 coverage. Is there a way to modify the default behaviour to do this? There's nothing obvious in the provided flags but maybe it's possible to modify the original code.

mpre commented 3 years ago

Hi Oliver, as you correctly understood, MALVA filters out kmers occurring only once and considers them as errors. There's no easy way to avoid this using the version distributed through conda.

If you use the version available here on github you can edit line 107 of the MALVA bash script in the root directory and add the -ci1 flag after ${KMC_BIN}.

Please consider that MALVA relies on high coverage to call genotypes so the result you get after setting that flag to 1 might be inaccurate.

OliverPStuart commented 3 years ago

Thank you. I've given this a try and it does change the behaviour somewhat (i.e. the outputs are different), but there are no genotypes in the output called from low-frequency (n=1) k-mers. Is there anything in the design of MALVA that would create a case where a genotype is not called even when a k-mer is found that corresponds to it?

I appreciate that our use case is definitely not what MALVA was designed for (coverage and organism) so I'm interested to get a better handle on how MALVA operates so we can decide if it suits our project.

ldenti commented 3 years ago

Hi Oliver, a quick question:

there are no genotypes in the output

do you mean the variants are called 0 instead of 1?

MALVA uses allele frequencies in the population and kmer coverages to compute the likelihood of each possible genotype of a variant and then assign the most likely one. It may be the case that the a priori probabilities used (ie by default the frequencies of each allele in the considered population) are forcing MALVA to call a variant 0 since the coverage for the alternate allele is not high enough.

Can you please send a variant from your input VCF file that has been miscalled by MALVA?