fgvieira / ngsLD

Calculation of pairwise Linkage Disequilibrium (LD) under a probabilistic framework
GNU General Public License v2.0
43 stars 7 forks source link

ANGSD VCF reformatting #52

Closed NicMAlexandre closed 9 months ago

NicMAlexandre commented 10 months ago

Hello,

I am using the vcf output from ANGSD which contains the PL and PP fields.

  1. Do you have any tips for formatting this file as input for ngsLD?

  2. I have 11 million SNPS, 7 Million above 1% MAF, and 6 million > 5% MAF. What is your recommendation for filtering by allele frequency to reduce the run time and memory.

NicMAlexandre commented 10 months ago

Here is a typical line:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 12443

Scaffold1.1 34783 . A G . PASS NS=386;AF=0.372944 GT:GL:PL:GP 0/1:-31.6,0,-90.2:316,0,902:1.68258e-32,1,0

fgvieira commented 10 months ago

The easiest way is to also output the beagle file on angsd.

As for filtering, low freq SNPs are not very informative, so you can remove them from the LD analysis (both 1% and 5% are ok); but keep in mind that those SNPs will not be included in the output file! Meaning that, if you are planning to do LD pruning afterwards, those sites won't show up in the list of unlinked sites.

If you are interested in those sites, you can either filter them on the pruning step (but ngsLD will be slower) or use the list of excluded/linked sites and remove them from your dataset.

NicMAlexandre commented 10 months ago
  1. I see, I can exclude variants less than 5% MAF and then include them in my list of unlinked sites afterwards. That is good to know about the output!

  2. Can you expand on this I'm having trouble understanding "If you are interested in those sites, you can either filter them on the pruning step (but ngsLD will be slower) or use the list of excluded/linked sites and remove them from your dataset." Are you saying either I can 1) include the lower MAFs and run ngsLD, which will be slower from the larger number of sites, or 2) I can just exclude those sites and just add them to my list of unlinked sites after getting a list of pruned variants.

  3. As for the format, the problem is that I don't have access to bams because they take up too much space on our server, and ANGSD doesn't convert VCF files to GLF format, so I have to work directly with VCFs. I can convert the VCF to beagle format, although I'm not sure if ANGSD beagle output is slightly different from other beagle formats (I will confirm).

fgvieira commented 10 months ago
  1. Yes, that would work.
  2. I meant that you have two options:
    1. run ngsLD on your full dataset (slower) with extended output, prune linked sites but ignoring LD between sites with low MAF (untested eg: prune_graph -i example.ld --header --weight-field "r2" --weight-filter "dist < 150000 && r2 > 0.2 && maf1 > 0.05 && maf2 > 0.05" --out out.keep), use list of unlinked sites.
    2. filter low MAF sites, run ngsLD, prune linked sites, add low MAF sites to list of unliked sites
  3. You can use angsd to read the GL from a VCF and create a beagle file (http://popgen.dk/angsd/index.php/Input#BCF.2FVCF_files), but I have not tested it. And remember to check if your VCF uses the GP, PL or the GL tag (and use the corresponding option as in https://github.com/ANGSD/angsd/blob/8a58e179d23a3230adf2fa4115e16ecf057a9695/multiReader.cpp#L62-L64).
NicMAlexandre commented 9 months ago

Thank you! 1 and 2 are all sorted out!

I am still trying to work on getting the correct Geno file, but ANGSD ran successfully. Here is an example beagle file for two individuals and two positions:

marker allele1 allele2 Ind0 Ind0 Ind0 Ind1 Ind1 Ind1 CM060204.1_15474 0 1 0.000000 1.000000 0.000000 1.000000 0.000000 0.000000 CM060204.1_16136 0 2 0.000000 1.000000 0.000000 1.000000 0.000000 0.000000

fgvieira commented 9 months ago

If Ind0 has tow heterozygotes and Ind1 has two homozygotes major, then it looks fine. 😄

NicMAlexandre commented 9 months ago

Yes they do, excellent! Since we are limiting comparisons to about 50Kb or 100Kb, I am going to split each job by scaffold which should speed up the process.

NicMAlexandre commented 9 months ago

It is working thanks a bunch! I will close this!!