Closed Mary-00 closed 3 years ago
Do you still have only a few thousands SNPs? Then, did you try to add null SNPs as I recommended in the other issue?
Yes, I have about 2400 SNPs. Sorry, what's your mean with null SNP and please kindly tell me how I shall add it?
If you go back to https://github.com/bcm-uga/pcadapt/issues/31, I point to https://github.com/bcm-uga/pcadapt/issues/24 that is a similar issue as yours I think. At the end, I show how to add null SNPs because it is important for pcadapt.
If I get it right, you repeat the following operation find SNPs differentiated between "your" population and a population of choice of the 1000 genome
If this is correct, you should use K=1 because PC1 will be the axis corresponding to differentiation between "your" population and a population of choice of the 1000 genome
Right, but, now I also calculated deltaDAF and find high correlation between Fst and deltaDAF. Yes, I used K=1 for pcadapt. I'm following privefl's comment and adding null SNP.
Hi, Privefl, I followed your instruction and used the below command:
vcf2pcadapt("file1.vcf", output = "file2.pcadapt", allele.sep = c("/", "|"))
geno <- read.pcadapt ("file2.pcadapt", type="pcadapt")
adding Null SNP
mat <- pcadapt::bed2matrix(geno)
N <- nrow(mat)
M <- 50e3
mat_null <- sapply(runif(M, min = 0.05, max = 0.5), function(af) {rbinom(N, size = 2, prob = af)})
mat2 <- cbind(mat, mat_null)
geno <- read.pcadapt(mat2, type = "lfmm")
x <- pcadapt (input = geno, K = 1)
res <- pcadapt (geno, K = 1, LD.clumping = list(size=200, thr = 0.1))
p.adj <- p.adjust (res$pvalue, method=”fdr”)
alpha <- 0.05
outliers <- which (p.adj < alpha)
with this code, pcadapt returned all my SNP (2400) as outlier!!. Could you please kindly check the code and let me know your suggestion?
Many thanks
Does the histogram of p-values looks good? Can you share your vcf file?
Thanks a lot for your prompt feedback. Here is p-value histogram
Hope the problem solved here; honestly, I should consult with our head for sharing.
Seems good.
Hi privefl, Thanks, but I get all my SNPs as outliers! that is not good. Please kindly share me if you have another suggestion/advice to solve the issue or I share the vcf file with you?
I'm just saying that the p-values looks calibrated, which is good.
For your SNPs, maybe they are all outliers given how you chosed this subset in the first place. Or maybe, pcadapt is not fit for your particular problem. You should run it on the entire chip if possible.
Nope, your procedure does not seem good to me @Mary-00. You have simulated null SNPS using
mat_null <- sapply(runif(M, min = 0.05, max = 0.5), function(af) {rbinom(N, size = 2, prob = af)})
I do not expect this procedure to be correct.
What you should do instead is to include both SNP variants related to a complex disease and random SNPs in the genome (or all other SNPs). Doing that, you will see that the SNPs related to the disease will not be systematically considered as outliers.
Hi Thank you for all your feedback. I used the code as privefl suggested.
Actually, the selected variants were resulted from the whole genome sequencing, not chip; so the all variants are located at the various VCF files for each chromosome. This is also true for 1000 genome population, which the corresponded variants are in the various VCF files for each chromosome; for using all variants for pcadapt, I should merged the vcf files of all chromosomes in my population and 1000 genome population, then merged them and feed a very huge file into pcadapt, yes? If pcadapt can handle such a big file? So, using random SNP may be more useful than all variants, could you please kindly share me your idea about it?
In the case of using random SNP, I should extract the random SNP with matched (similar) allele frequency and LD value with my SNP list from my population and 1000 genome population, am I correct? As far as I searched, I could not find any tool/script for extracting such a random SNPs (my mean random SNP with similar allele frequency and LD value with my SNP list). Could you please kindly share me if you have any suggestion/solution to get such a random SNP and how I should add them to my real variants?
I would like to stay with pcadapt in my work, hop the problem solved.
Many thanks in advance
I would use PLINK to subset variants. You can provide a list of variant IDs as well as randomly keeping a subset proportion as well as performing some pruning.
I should merged the vcf files of all chromosomes in my population and 1000 genome population, then merged them and feed a very huge file into pcadapt, yes? Yes, pcadapt can handle very large genotype files. If you can use all polymorphic variants, it is even better than a random subsample. Will be away from github for some time.
Thanks for your feedback. First, I prefer trying random SNP as privefl suggested;however, I'm a bit confused what exactly shall I do. Could you please advice me how I can get random SNP considering similar allele frequency and LD value to my snp list from various populations of 1000 genome?
Thanks
I don't know how to do this precisely. I would just subset to SNPs with MAF > 5% and then get a random proportion.
OK, so I considered just MAF as you suggested; however, I found that my population has many variants with allele frequency (AF) of lower than 0.05, unlike the 1000 genome populations; actually, these variants are really different, so how we don't consider them?
Regarding the random SNP selection, the number of random SNP should be similar to my SNP number, around 2400, is it right?
I see that I haven't followed up on this. Did you manage to solve the issue?
Dear Developers, With reference to https://github.com/bcm-uga/pcadapt/issues/31, I’m working with a list of SNP variants related to a complex disease. I would like to find which SNPs are outlier between my population and each population of 1000 genome project. To this end, I calculated Fst, deltaDAF, and used pcadapt (version 4.0.3) to find outlier SNPs between my population and each population of 1000 genome. The below code was used:
While the results of Fst and deltaDAF are highly correlated (based on Pearson’s coefficient), the pcadapt returned so different result that doesn’t match with Fst and delatDAF. I’m not sure how I shall interpret these results or there may be anything wrong with pcadapt. I also test Outflank, but it gave me strange output even for Fst; as its author was not responsive, I leave this analysis. Anyway, could you please kindly suggest me how I can interpret these results?
Almost all plots is similar to https://github.com/bcm-uga/pcadapt/issues/31; however, please let me know if you need more information.
Best