Open HomereAMK opened 1 year ago
Hi Homère,
The SNP weights (loadings) are stored in the "11apr23RNLOUe10_scaffold8_InvReg_pcangsd.weights.npy" file. :-) It is in binary NumPy format, so you can open the file in Python or R:
import numpy as np
W = np.load("11apr23RNLOUe10_scaffold8_InvReg_pcangsd.weights.npy")
library(RcppCNPy)
W <- npyLoad("11apr23RNLOUe10_scaffold8_InvReg_pcangsd.weights.npy")
Best, Jonas
Hi Jonas,
Thanks for your speedy answer, much appreciated! I have a follow-up question regarding the SNP order in the output files. Specifically, I'm wondering if the order of SNPs in the Beagle file is preserved in the output of the pcangsd job. Additionally, is the SNP order maintained in the initial ANGSD job where I used the -sites flag with a list of SNPs like the following:
head $OUTPUTFOLDER/scaffold8/scaffold8_InvReg_awk.txt
scaffold8 32001789 T C
scaffold8 32001826 C T
scaffold8 32001877 C A
scaffold8 32001892 C T
scaffold8 32001902 T C
scaffold8 32001906 T C
scaffold8 32001912 C T
scaffold8 32001942 A C
scaffold8 32001953 A G
scaffold8 32002065 A G
I ask because when I plot the 1apr23RNLOUe10_scaffold8_InvReg_pcangsd.weights.npy file (SNPs against their loading values for PC1), I observe an unusual distribution where the most contributing SNPs are clustered at the beginning of the region of interest, assuming that the order of SNPs is maintained throughout ANGSD and pcangsd jobs...
#the r code used
Reg08_PCloading <- read_tsv("~/Desktop/Data/11apr23RNLOUe10_scaffold8_InvReg_pcangsd_snploadings.tsv")
ggplot(Reg08_PCloading, aes(x = 1:nrow(Reg08_PCloading), y = V1)) +
geom_point() +
scale_x_continuous(breaks = seq(0, nrow(Reg08_PCloading), by = 5000)) +
ggtitle("Manhattan plot PC1 loadings")
Best, Homère
Hmm, yes that does look a bit funky. The order of SNPs will be the same as in the Beagle file. If there has been some filtering (MAF etc.), then some sites have been removed but the overall ordering is not changed.
I’m having difficulty finding a way to obtain SNP loading values from a PCA axis using pcangsd. I’m curious to know if the SNPs contributing the most to the 1st axis are not the same that contribute the most to the 2nd, and where they the most contributing SNPS are localized in my species genome. I ran pcangd on a beagle file with these parameters and in particular the --selection_e and --snp_weights flags:
I got these outputs:
But I'm not sure on how to proceed next to get the loadings per SNPs per PCs... Thanks! Homère