ANGSD / angsd

Program for analysing NGS data.
227 stars 50 forks source link

[question] Applying the genotype likelihood framework to SVs #422

Open clairemerot opened 3 years ago

clairemerot commented 3 years ago

ANGSD suite is very useful and relevant to study SNPs variation in low-coverage data and I am wondering about how to best take advantage of the existing framework to also study other kinds of variants like structural variants.

I noticed that, as an input, one can give a vcf or a beagle. Let's say one manage to get a set of genotype likelihoods for biallelic SVs (small indels or larger deletions, inversions). It seems that I am getting that from vg call for instance, in which the likelihood of genotype is ponderated by depth support (but not read quality).

Would it make sense to try transforming this vcf of GL into a beagle (using the 1st position of the SVs as POS and dummy major minor (reference/alternative for instance) to use some ANGSD tools like SFS/Fst, or PCAangsd for PCA? Are there any caveats that I am unaware of that would be a problem either technical or statistical?

This is a relatively new direction, so I'd be happy to hear other people thoughts on the question, particularly from the ANGSD team... I believe that if this works, it will be very useful for the community as papers genotyping SVs are flourishing but there is no framework to really deal with them neither in a probabilistic framework nor at population level.

Thanks a lot for your help and attention. And many thanks for maintaining this very useful tool!!! Best Claire

nspope commented 3 years ago

Hi Claire, shouldn't be a problem using genotype likelihoods from biallelic SVs for the algorithms implemented in angsd. You would have to trick angsd by recoding as SNPs like you mention. GATK's HaplotypeCaller is a good option to get GLs for short SVs.

I would think carefully about what you are calculating before combining SNPs and SVs in the same dataset though-- I think it's probably fine for things like admixture or PCA but would be dubious for estimating population genetic parameters that are scaled by mutation rate (e.g. various estimators of theta).

clairemerot commented 3 years ago

Thanks a lot for your insights! Oh yes I agree I wouldn't touch thetas with this kind of data. It will also be a problem to know the invariants sites given the non-coded size of the SVs. I'll try to proceed and format the files as best as I can to get into PCA and Fst, maybe MAF. i can keep in touch about what works and what doesn't. Cheers

clairemerot commented 3 years ago

Some follow-up: Converting genotypes likelihoods of SVs (from vcf) into beagle has been successful. I had to make minor adjustments to make it readable by angsd (chr_pos for instance + dummy value for REF/ALT alleles), and to normalize GL with a home-made script (otherwise low values led to division by zero when read by PCangsd). Looking at the GL of SVs, I feel that's interesting to keep part of the uncertainty (and deal with missing data), and we have a nicer dataset than when calling GT.

A PCA with pcangsd ran smoothly and provide expected results.

When trying Fst, I am facing an issue to do the Saf. It requires an ancestral genome for polarisation (-anc) which, if provided, makes a conflict because of my dummy alleles in the beagle. Instead, I have tried to provide a sites files (-sites) with the same dummy alleles and -doMajorMinor 3 BUT it seems that an input -beagle and -doMajorMinor are incompatible from lines 110 -113 from abcMajorMinor.cpp scrip

if(inputtype==INPUT_BEAGLE&&doMajorMinor){
    fprintf(stderr,"\t-> Potential problem: Cannot estimate the major and minor based on posterior probabilities\n");
    exit(0);
  }

I'm not sure why this error appears but this prevents -doSaf to run. If one of you have insight or help on that point, I'd be interested! Alternatively, my plan is to re-work on my dummy alleles to make them match the reference/ancestral genome.

[edit] A few days later... It does work!! I just needed to use -doSaf 4 & -doMaf 4 (instead of -doSaf 1 and -doMaf 1). Sorry about that!!

Thanks for your help and this great tool! Best regards Claire