Rosemeis / pcangsd

Framework for analyzing low depth NGS data in heterogeneous populations using PCA.
GNU General Public License v3.0
47 stars 11 forks source link

SNP loading values from PCs using --selection_e #73

Open HomereAMK opened 1 year ago

HomereAMK commented 1 year ago

I’m having difficulty finding a way to obtain SNP loading values from a PCA axis using pcangsd. I’m curious to know if the SNPs contributing the most to the 1st axis are not the same that contribute the most to the 2nd, and where they the most contributing SNPS are localized in my species genome. I ran pcangd on a beagle file with these parameters and in particular the --selection_e and --snp_weights flags:

  pcangsd -b "$OUTPUTFOLDER/6feb23_scaffold8_InvReg.beagle.gz" \
  --selection \
  --selection_e 10 \
  --pcadapt \
  --sites_save \
  --snp_weights \
  --pcadapt \
  --maf_save \
  -e 10 \
  --threads 40 \
  -o "$OUTPUTFOLDER/11apr23RNLOUe10_scaffold8_InvReg_pcangsd"

I got these outputs:

-rw-r--r-- 1 hmon dp_00007 374K Apr 11 15:24 scaffold8/11apr23RNLOUe10_scaffold8_InvReg_pcangsd.sites
-rw-r--r-- 1 hmon dp_00007 1.5M Apr 11 15:24 scaffold8/11apr23RNLOUe10_scaffold8_InvReg_pcangsd.maf.npy
-rw-r--r-- 1 hmon dp_00007  15M Apr 11 15:24 scaffold8/11apr23RNLOUe10_scaffold8_InvReg_pcangsd.pcadapt.zscores.npy
-rw-r--r-- 1 hmon dp_00007  15M Apr 11 15:24 scaffold8/11apr23RNLOUe10_scaffold8_InvReg_pcangsd.weights.npy
-rw-r--r-- 1 hmon dp_00007  15M Apr 11 15:24 scaffold8/11apr23RNLOUe10_scaffold8_InvReg_pcangsd.selection.npy
-rw-r--r-- 1 hmon dp_00007 3.4M Apr 11 15:24 scaffold8/11apr23RNLOUe10_scaffold8_InvReg_pcangsd.cov
-rw-r--r-- 1 hmon dp_00007  488 Apr 11 15:23 scaffold8/11apr23RNLOUe10_scaffold8_InvReg_pcangsd.args 

But I'm not sure on how to proceed next to get the loadings per SNPs per PCs... Thanks! Homère

Rosemeis commented 1 year ago

Hi Homère,

The SNP weights (loadings) are stored in the "11apr23RNLOUe10_scaffold8_InvReg_pcangsd.weights.npy" file. :-) It is in binary NumPy format, so you can open the file in Python or R:

import numpy as np
W = np.load("11apr23RNLOUe10_scaffold8_InvReg_pcangsd.weights.npy")
library(RcppCNPy)
W <- npyLoad("11apr23RNLOUe10_scaffold8_InvReg_pcangsd.weights.npy")

Best, Jonas

HomereAMK commented 1 year ago

Hi Jonas,

Thanks for your speedy answer, much appreciated! I have a follow-up question regarding the SNP order in the output files. Specifically, I'm wondering if the order of SNPs in the Beagle file is preserved in the output of the pcangsd job. Additionally, is the SNP order maintained in the initial ANGSD job where I used the -sites flag with a list of SNPs like the following:

head $OUTPUTFOLDER/scaffold8/scaffold8_InvReg_awk.txt
scaffold8 32001789 T C
scaffold8 32001826 C T
scaffold8 32001877 C A
scaffold8 32001892 C T
scaffold8 32001902 T C
scaffold8 32001906 T C
scaffold8 32001912 C T
scaffold8 32001942 A C
scaffold8 32001953 A G
scaffold8 32002065 A G

I ask because when I plot the 1apr23RNLOUe10_scaffold8_InvReg_pcangsd.weights.npy file (SNPs against their loading values for PC1), I observe an unusual distribution where the most contributing SNPs are clustered at the beginning of the region of interest, assuming that the order of SNPs is maintained throughout ANGSD and pcangsd jobs... image

#the r code used 
Reg08_PCloading <- read_tsv("~/Desktop/Data/11apr23RNLOUe10_scaffold8_InvReg_pcangsd_snploadings.tsv")
ggplot(Reg08_PCloading, aes(x = 1:nrow(Reg08_PCloading), y = V1)) +
  geom_point() +
  scale_x_continuous(breaks = seq(0, nrow(Reg08_PCloading), by = 5000)) +
  ggtitle("Manhattan plot PC1 loadings")

Best, Homère

Rosemeis commented 1 year ago

Hmm, yes that does look a bit funky. The order of SNPs will be the same as in the Beagle file. If there has been some filtering (MAF etc.), then some sites have been removed but the overall ordering is not changed.