Inconsistency between Fst and deltaDAF (derived allele frequency) with pcadapt

Mary-00 commented 5 years ago

Dear Developers, With reference to https://github.com/bcm-uga/pcadapt/issues/31, I’m working with a list of SNP variants related to a complex disease. I would like to find which SNPs are outlier between my population and each population of 1000 genome project. To this end, I calculated Fst, deltaDAF, and used pcadapt (version 4.0.3) to find outlier SNPs between my population and each population of 1000 genome. The below code was used:

vcf2pcadapt("file1.vcf", output = "file2.pcadapt", allele.sep = c("/", "|"))
geno_pcadapt <- read.pcadapt ("file2.pcadapt", type="pcadapt")
x <- pcadapt (input = geno_pcadapt, K = 2)
plot (x, option = "screeplot")
res <- pcadapt (geno_pcadapt, K = 2, LD.clumping = list(size=200, thr = 0.1))
p.adj <- p.adjust (res$pvalue, method=”fdr”)
alpha <- 0.05
outliers <- which (p.adj < alpha)

While the results of Fst and deltaDAF are highly correlated (based on Pearson’s coefficient), the pcadapt returned so different result that doesn’t match with Fst and delatDAF. I’m not sure how I shall interpret these results or there may be anything wrong with pcadapt. I also test Outflank, but it gave me strange output even for Fst; as its author was not responsive, I leave this analysis. Anyway, could you please kindly suggest me how I can interpret these results?

Almost all plots is similar to https://github.com/bcm-uga/pcadapt/issues/31; however, please let me know if you need more information.

Best

privefl commented 5 years ago

Do you still have only a few thousands SNPs? Then, did you try to add null SNPs as I recommended in the other issue?

Mary-00 commented 5 years ago

Yes, I have about 2400 SNPs. Sorry, what's your mean with null SNP and please kindly tell me how I shall add it?

privefl commented 5 years ago

If you go back to https://github.com/bcm-uga/pcadapt/issues/31, I point to https://github.com/bcm-uga/pcadapt/issues/24 that is a similar issue as yours I think. At the end, I show how to add null SNPs because it is important for pcadapt.

mblumuga commented 5 years ago

If I get it right, you repeat the following operation find SNPs differentiated between "your" population and a population of choice of the 1000 genome

If this is correct, you should use K=1 because PC1 will be the axis corresponding to differentiation between "your" population and a population of choice of the 1000 genome

Mary-00 commented 5 years ago

Right, but, now I also calculated deltaDAF and find high correlation between Fst and deltaDAF. Yes, I used K=1 for pcadapt. I'm following privefl's comment and adding null SNP.

Mary-00 commented 5 years ago

Hi, Privefl, I followed your instruction and used the below command:

vcf2pcadapt("file1.vcf", output = "file2.pcadapt", allele.sep = c("/", "|"))
geno <- read.pcadapt ("file2.pcadapt", type="pcadapt")

adding Null SNP
mat <- pcadapt::bed2matrix(geno)
N <- nrow(mat)
M <- 50e3
mat_null <- sapply(runif(M, min = 0.05, max = 0.5), function(af) {rbinom(N, size = 2, prob = af)})
mat2 <- cbind(mat, mat_null)
geno <- read.pcadapt(mat2, type = "lfmm")

x <- pcadapt (input = geno, K = 1)
res <- pcadapt (geno, K = 1, LD.clumping = list(size=200, thr = 0.1))
p.adj <- p.adjust (res$pvalue, method=”fdr”)
alpha <- 0.05
outliers <- which (p.adj < alpha)

with this code, pcadapt returned all my SNP (2400) as outlier!!. Could you please kindly check the code and let me know your suggestion?

Many thanks

privefl commented 5 years ago

Does the histogram of p-values looks good? Can you share your vcf file?

Mary-00 commented 5 years ago

Thanks a lot for your prompt feedback. Here is p-value histogram

pval4

Hope the problem solved here; honestly, I should consult with our head for sharing.

privefl commented 5 years ago

Seems good.

Mary-00 commented 5 years ago

Hi privefl, Thanks, but I get all my SNPs as outliers! that is not good. Please kindly share me if you have another suggestion/advice to solve the issue or I share the vcf file with you?

privefl commented 5 years ago

I'm just saying that the p-values looks calibrated, which is good.

For your SNPs, maybe they are all outliers given how you chosed this subset in the first place. Or maybe, pcadapt is not fit for your particular problem. You should run it on the entire chip if possible.

mblumuga commented 5 years ago

Nope, your procedure does not seem good to me @Mary-00. You have simulated null SNPS using mat_null <- sapply(runif(M, min = 0.05, max = 0.5), function(af) {rbinom(N, size = 2, prob = af)}) I do not expect this procedure to be correct.

What you should do instead is to include both SNP variants related to a complex disease and random SNPs in the genome (or all other SNPs). Doing that, you will see that the SNPs related to the disease will not be systematically considered as outliers.

Mary-00 commented 5 years ago

Hi Thank you for all your feedback. I used the code as privefl suggested.

Actually, the selected variants were resulted from the whole genome sequencing, not chip; so the all variants are located at the various VCF files for each chromosome. This is also true for 1000 genome population, which the corresponded variants are in the various VCF files for each chromosome; for using all variants for pcadapt, I should merged the vcf files of all chromosomes in my population and 1000 genome population, then merged them and feed a very huge file into pcadapt, yes? If pcadapt can handle such a big file? So, using random SNP may be more useful than all variants, could you please kindly share me your idea about it?

In the case of using random SNP, I should extract the random SNP with matched (similar) allele frequency and LD value with my SNP list from my population and 1000 genome population, am I correct? As far as I searched, I could not find any tool/script for extracting such a random SNPs (my mean random SNP with similar allele frequency and LD value with my SNP list). Could you please kindly share me if you have any suggestion/solution to get such a random SNP and how I should add them to my real variants?

I would like to stay with pcadapt in my work, hop the problem solved.

Many thanks in advance

privefl commented 5 years ago

I would use PLINK to subset variants. You can provide a list of variant IDs as well as randomly keeping a subset proportion as well as performing some pruning.

mblumuga commented 5 years ago

I should merged the vcf files of all chromosomes in my population and 1000 genome population, then merged them and feed a very huge file into pcadapt, yes? Yes, pcadapt can handle very large genotype files. If you can use all polymorphic variants, it is even better than a random subsample. Will be away from github for some time.

Mary-00 commented 5 years ago

Thanks for your feedback. First, I prefer trying random SNP as privefl suggested;however, I'm a bit confused what exactly shall I do. Could you please advice me how I can get random SNP considering similar allele frequency and LD value to my snp list from various populations of 1000 genome?

Thanks

privefl commented 5 years ago

I don't know how to do this precisely. I would just subset to SNPs with MAF > 5% and then get a random proportion.

Mary-00 commented 5 years ago

OK, so I considered just MAF as you suggested; however, I found that my population has many variants with allele frequency (AF) of lower than 0.05, unlike the 1000 genome populations; actually, these variants are really different, so how we don't consider them?

Regarding the random SNP selection, the number of random SNP should be similar to my SNP number, around 2400, is it right?

privefl commented 4 years ago

I see that I haven't followed up on this. Did you manage to solve the issue?

bcm-uga / pcadapt

Inconsistency between Fst and deltaDAF (derived allele frequency) with pcadapt #39