Closed apfuentes closed 6 years ago
We are updating the package and the new version does not require to simulate individuals for Pooled-seq data, it is simpler to use. Please read the answer I provide to another user: #10, it is explained how to install and run the 4.0 version.
Great, thanks! Is the current development version stable and safe to use? Is there an estimated date for the release of the new version?
Also, would the other functions of pcadapt
(like PCA, score and scree plots, get.pc, communality stats, component-wise analysis) still apply to Pool-seq data?
It is safe to use, yes. Right now, it assumes that each pool corresponds to a population. It does not account for situations in which you have biological replicates. It is the same as using a PCA with K=10 axes (to distinguish 11 populations).
To answer to your 2nd question, other functions of pcadapt do not apply for Pool-seq data in the 4.0 version because we assume that K=(number of pool -1).
However, it is true that there might be situation where looking at the scree plot and the score plot is meaningful with Pool-seq data. Would you accept to send me your data or part of your data (subset of SNPs) so that I can have a look and see how we can improve the package?
Hi @mblumuga Thanks for your prompt response. For sure, please indicate me what is the best way to send you some data (privately), ~530 MB.
Regarding functionalities for Pool-seq, it would be very useful to identify outlier loci that are common to some populations but not to others (outliers that would be driving structuring among populations=pools), and being able to identify such loci for different PCs (substructuring?). So, maybe the scree and score plots could help on that?
Thanks!
Can you send me an email to michael.blum@univ-grenoble-alpes.fr. I will send you a link to upload your file.
Hi @mblumuga
I want to share that I tried the new version of pcadapt
on my whole Pool-seq dataset and it run fine! It only took ~1h to finish and it used ~12 GB RAM :)
I used the commands described in the tutorial to identify outlier loci in Pool-seq data:
plot(-log10(res$pvalues),pch=19,cex=.5)
padj <- p.adjust(res$pvalues,method="BH")
alpha <- 0.1
outliers <- which(padj < alpha)
length(outliers)
Some questions:
pcadapt
version includes these analyses for Pool-seq data?
Thanks.
Outliers correspond to markers with atypical differences of allele frequencies.
Does the list of outliers correspond to Axes 1-2? Or overall? Overall
Would it be possible that the new pcadapt version includes these analyses for Pool-seq data? Yes, it should be available in the version 4.0, I agree. We are working on it right now. I will get back to you very soon I hope when these features are included.
Fantastic news, THANKS A TON!!
The new version is available on https://github.com/privefl/pcadapt/. Score plot, scree plot, and get.pc functions available for Pooled-seq data. Will be glad to read your remarks if you have any.
Wonderful! Thanks for letting me know it. I will give it a try soon.
Hi,
I would like to confirm that pcadapt
worked well and run fast on my dataset of 7 million SNPs for 16 pools obtained with Pool-seq (whole genome resequencing of pooled DNA). Thanks a lot for improving the package for pool data :smile:
I have some questions: 1) How can the score plot be customized? In the tutorial for pool data it is not mentioned that a score plot can be obtained, but I generated one using the same code as in individual genotype data:
plot(res, option = "scores", pop = poplist.names)
poplist.names
is a vector with pool names (in my case a pool is a population).
I would like to make these changes:
Is there any way I could make these modifications directly using the ggplot package on the PCAdapt results?
2) How to obtain SNP names printed in the get.pc() output?
When I run get.pc(res, outliers)
to identify to which PC an outlier is associated to, I obtain this output:
I would like to obtain the SNP names (not the indexes) that in my data frame correspond to the column names with the format Super_Scaffold0-9370530
.
3) Are these functions available for pool data?
4) What is the difference between get.pc(res, outliers) and component-wise genome scans? I understand that by default the P-values are calculated for all the SNPs over all the retained PCs, and the component-wise genome scan calculates P-values for each PC. In which cases one is preferred over the other? I am interested on identifying outlier SNPs that discriminate among pools.
5) What numerical value generated by pcadapt
is equivalent to cos2
in a Correspondence Analysis (CA)?
I would like to evaluate the degree of association between rows (pools) and a particular axis (in other words, the quality of representation of the rows in each dimension). In a Correspondence Analysis this is equivalent to the squared cosine (cos2
) or the squared correlations between rows/columns and a particular axis (PC). The values of the cos2
are comprised between 0 and 1. The sum of the cos2 for rows on all the CA dimensions is equal to one (explained here).
6) Input data for pool analysis
I just wanted to confirm if the relative allele frequency required as input data for pool analysis is equals to REF allele counts/(REF + ALT counts)
for each pool and SNP.
Thanks a lot!
Glad that pcadapt worked well and your pool data containing 7 million SNPs for 16 pools.
1) How can the score plot be customized? Not possible right now, you have PC scores in res$scores (res<-pcadapt(...)). So res$scores[,1] is the 1st PC, res$scores[,2] is the 2nd PC... You can then use ggplot as usual using these 2 vectors.
2) How to obtain SNP names printed in the get.pc() output? get.pc()[,1] give you the indices, if you have the vector containing all names, you can access to the desired names.
3) Are these functions available for pool data?
Q-Q plot Histogram of the test statistic and of the p-values Component-wise genome scans Yes pool.data <- system.file("extdata", "pool3pops", package = "pcadapt") filename <- read.pcadapt(pool.data, type = "pool") res <- pcadapt(filename) plot(res,option="qqplot") plot(res,option="stat.distribution") res <- pcadapt(filename,method="componentwise")
4) What is the difference between get.pc(res, outliers) and component-wise genome scans? I understand that by default the P-values are calculated for all the SNPs over all the retained PCs, >and the component-wise genome scan calculates P-values for each PC. Correct In which cases one is preferred over the other? I prefer the default option because you get a single list of P-values. Interpreting K list of P-values might be tedious.
5) What numerical value generated by pcadapt is equivalent to cos2 in a Correspondence Analysis (CA)? I do not know.
6) Input data for pool analysis I just wanted to confirm if the relative allele frequency required as input data for pool analysis is equals to REF allele counts/(REF + ALT counts) for each pool and SNP.
Correct
Thanks very much for your response and clear explanations, much appreciate it! Best wishes,
Hi,
I run pcadapt
on 3 pools (each pool obtained following the Pool-seq approach) for which two PCs were retained. Then I checked the singular.values
of those PCs to calculate the percentage of the variance explained by each PC using the formula round(((res$singular.values)^2) / length(res$maf),digits = 2)
. To my surprise the sum of the two PCs is greater than 100% (PC1: 97.39%, PC2: 93.47%). Maybe this is a bug?
Code used:
> summary(res)
Length Class Mode
scores 6 -none- numeric
singular.values 2 -none- numeric
loadings 14157356 -none- numeric
zscores 14157356 -none- numeric
af 7078678 -none- numeric
maf 7078678 -none- numeric
chi2.stat 7078678 -none- numeric
stat 7078678 -none- numeric
gif 1 -none- numeric
pvalues 7078678 -none- numeric
pass 7078678 -none- logical
> res$scores
[,1] [,2]
[1,] 0.6481888 0.4965057
[2,] 0.1058922 -0.8096008
[3,] -0.7540810 0.3130951
> res$singular.values
[1] 26256.92 25722.40
> pct_pc <- round(((res$singular.values)^2) / length(res$maf),digits = 2)
> pct_pc
[1] 97.39 93.47
Thanks.
The formula to get the percentage of variance explained is more complex because allele frequencies are not scaled for Pool-seq data. This being said, percentage of variance explained is not useful here because the number of data points (pool) is too small w.r.t. to the number of variables to have a good estimate of these quantities.
OK, thanks very much for the clarification. Would still be valid to use pcadapt for the identification of SNPs differentiating these 3 populations?
Also, I would be interested on using the formula to estimate the percentage of variance explained per PC in Pool-seq data. Would it be possible to share it, please? Thanks
Would still be valid to use pcadapt for the identification of SNPs differentiating these 3 populations?
Sure
Also, I would be interested on using the formula to estimate the percentage of variance explained per >PC in Pool-seq data. Would it be possible to share it, please? Thanks
When I find it, I will give it to you.
Hi, I would like to run
pcadapt
on my Pool-seq data (7 million SNPs, 11 populations). Before using the whole dataset, I did a test run on a subset of loci, so I could have a sense of runtime and how much RAM memory is required. I first tried with 500.000 SNPs, 11 populations, 1000 ind. genotypes simulated per population, however I obtained this error message:Same error appears when simulating the genotype of 500 individuals per population, but no error (program works) with 200 individuals simulated/pop.
Questions:
pcadapt
,R
, or Unixulimits
? How could it be fixed?pcadapt
analyze a data.frame of 11 rows (populations) and 7 million columns (loci)?This is the code I used:
Thanks for your help!