Error using read.pcadapt(pool.data, type = "pool") - long vectors not supported yet

apfuentes commented 6 years ago

Hi, I would like to run pcadapt on my Pool-seq data (7 million SNPs, 11 populations). Before using the whole dataset, I did a test run on a subset of loci, so I could have a sense of runtime and how much RAM memory is required. I first tried with 500.000 SNPs, 11 populations, 1000 ind. genotypes simulated per population, however I obtained this error message:

Error in which(is.na(tmp)) :
  long vectors not supported yet: ../../src/include/Rinlinedfuns.h:138

Same error appears when simulating the genotype of 500 individuals per population, but no error (program works) with 200 individuals simulated/pop.

Questions:

Is this an issue with pcadapt, R, or Unix ulimits? How could it be fixed?
Can pcadapt analyze a data.frame of 11 rows (populations) and 7 million columns (loci)?
Why is necessary to simulate 1000 individuals per population, as recommended in the tutorial? Could 200 or 500 individuals/pop. be enough? How this could be tested?

This is the code I used:

# Load libraries.
library(pcadapt)
library(data.table)

# Load data.
pool.data <- fread("/path/11pops_500K.AF", data.table=FALSE, header=TRUE, sep="\t", drop=c(1), stringsAsFactors=FALSE)

# Simulate individual genotypes from population allele frequencies.
filename <- read.pcadapt(pool.data,type="pool",local.env = TRUE, pop.sizes = c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000))

Thanks for your help!

mblumuga commented 6 years ago

We are updating the package and the new version does not require to simulate individuals for Pooled-seq data, it is simpler to use. Please read the answer I provide to another user: #10, it is explained how to install and run the 4.0 version.

apfuentes commented 6 years ago

Great, thanks! Is the current development version stable and safe to use? Is there an estimated date for the release of the new version?

Also, would the other functions of pcadapt (like PCA, score and scree plots, get.pc, communality stats, component-wise analysis) still apply to Pool-seq data?

mblumuga commented 6 years ago

It is safe to use, yes. Right now, it assumes that each pool corresponds to a population. It does not account for situations in which you have biological replicates. It is the same as using a PCA with K=10 axes (to distinguish 11 populations).

To answer to your 2nd question, other functions of pcadapt do not apply for Pool-seq data in the 4.0 version because we assume that K=(number of pool -1).

However, it is true that there might be situation where looking at the scree plot and the score plot is meaningful with Pool-seq data. Would you accept to send me your data or part of your data (subset of SNPs) so that I can have a look and see how we can improve the package?

apfuentes commented 6 years ago

Hi @mblumuga Thanks for your prompt response. For sure, please indicate me what is the best way to send you some data (privately), ~530 MB.

Regarding functionalities for Pool-seq, it would be very useful to identify outlier loci that are common to some populations but not to others (outliers that would be driving structuring among populations=pools), and being able to identify such loci for different PCs (substructuring?). So, maybe the scree and score plots could help on that?

Thanks!

mblumuga commented 6 years ago

Can you send me an email to michael.blum@univ-grenoble-alpes.fr. I will send you a link to upload your file.

apfuentes commented 6 years ago

Hi @mblumuga I want to share that I tried the new version of pcadapt on my whole Pool-seq dataset and it run fine! It only took ~1h to finish and it used ~12 GB RAM :)

I used the commands described in the tutorial to identify outlier loci in Pool-seq data:

plot(-log10(res$pvalues),pch=19,cex=.5)
padj <- p.adjust(res$pvalues,method="BH")
alpha <- 0.1
outliers <- which(padj < alpha)
length(outliers)

Some questions:

Does the list of outliers correspond to Axes 1-2? Or overall?
An alpha = 0.1 is used to identify outliers. Does this alpha correspond to the False Discovery Rate?
Would it be possible that the new pcadapt version includes these analyses for Pool-seq data?
- Test of the correlation between SNPs and the PCs; which SNPs contribute the most to some dimensions?
- Proportion of variance explained by each PC
- Proportion of the variance of a SNP explained by a given number of PCs
- Projection of the populations to specified PCs
- Possibility to evaluate if the loadings are clustered in a single or several genomic regions

Thanks.

mblumuga commented 6 years ago

Outliers correspond to markers with atypical differences of allele frequencies.

Does the list of outliers correspond to Axes 1-2? Or overall? Overall

Would it be possible that the new pcadapt version includes these analyses for Pool-seq data? Yes, it should be available in the version 4.0, I agree. We are working on it right now. I will get back to you very soon I hope when these features are included.

apfuentes commented 6 years ago

Fantastic news, THANKS A TON!!

mblumuga commented 6 years ago

The new version is available on https://github.com/privefl/pcadapt/. Score plot, scree plot, and get.pc functions available for Pooled-seq data. Will be glad to read your remarks if you have any.

apfuentes commented 6 years ago

Wonderful! Thanks for letting me know it. I will give it a try soon.

apfuentes commented 6 years ago

Hi, I would like to confirm that pcadapt worked well and run fast on my dataset of 7 million SNPs for 16 pools obtained with Pool-seq (whole genome resequencing of pooled DNA). Thanks a lot for improving the package for pool data :smile:

I have some questions: 1) How can the score plot be customized? In the tutorial for pool data it is not mentioned that a score plot can be obtained, but I generated one using the same code as in individual genotype data:

plot(res, option = "scores", pop = poplist.names)

poplist.names is a vector with pool names (in my case a pool is a population).

I would like to make these changes:

Include the % of the variation explained by each PC to the x and y axis
Modify the title
Set colors that reflect pool geographic location (corresponding to the order of rows in the file - from north to south = left to rigth, not alphabetic order as it seems to be set by default)
Choose a different color scheme and use different point shapes (e.g. empty or filled square, triangle, circle, etc...easier for prints in black and white)

Is there any way I could make these modifications directly using the ggplot package on the PCAdapt results?

2) How to obtain SNP names printed in the get.pc() output? When I run get.pc(res, outliers) to identify to which PC an outlier is associated to, I obtain this output: outliers_file_pcadapt I would like to obtain the SNP names (not the indexes) that in my data frame correspond to the column names with the format Super_Scaffold0-9370530.

3) Are these functions available for pool data?

Q-Q plot
Histogram of the test statistic and of the p-values
Component-wise genome scans

4) What is the difference between get.pc(res, outliers) and component-wise genome scans? I understand that by default the P-values are calculated for all the SNPs over all the retained PCs, and the component-wise genome scan calculates P-values for each PC. In which cases one is preferred over the other? I am interested on identifying outlier SNPs that discriminate among pools.

5) What numerical value generated by pcadapt is equivalent to cos2 in a Correspondence Analysis (CA)? I would like to evaluate the degree of association between rows (pools) and a particular axis (in other words, the quality of representation of the rows in each dimension). In a Correspondence Analysis this is equivalent to the squared cosine (cos2) or the squared correlations between rows/columns and a particular axis (PC). The values of the cos2 are comprised between 0 and 1. The sum of the cos2 for rows on all the CA dimensions is equal to one (explained here).

6) Input data for pool analysis I just wanted to confirm if the relative allele frequency required as input data for pool analysis is equals to REF allele counts/(REF + ALT counts) for each pool and SNP.

Thanks a lot!

mblumuga commented 6 years ago

Glad that pcadapt worked well and your pool data containing 7 million SNPs for 16 pools.

1) How can the score plot be customized? Not possible right now, you have PC scores in res$scores (res<-pcadapt(...)). So res$scores[,1] is the 1st PC, res$scores[,2] is the 2nd PC... You can then use ggplot as usual using these 2 vectors.

2) How to obtain SNP names printed in the get.pc() output? get.pc()[,1] give you the indices, if you have the vector containing all names, you can access to the desired names.

3) Are these functions available for pool data?

Q-Q plot Histogram of the test statistic and of the p-values Component-wise genome scans Yes pool.data <- system.file("extdata", "pool3pops", package = "pcadapt") filename <- read.pcadapt(pool.data, type = "pool") res <- pcadapt(filename) plot(res,option="qqplot") plot(res,option="stat.distribution") res <- pcadapt(filename,method="componentwise")

4) What is the difference between get.pc(res, outliers) and component-wise genome scans? I understand that by default the P-values are calculated for all the SNPs over all the retained PCs, >and the component-wise genome scan calculates P-values for each PC. Correct In which cases one is preferred over the other? I prefer the default option because you get a single list of P-values. Interpreting K list of P-values might be tedious.

5) What numerical value generated by pcadapt is equivalent to cos2 in a Correspondence Analysis (CA)? I do not know.

6) Input data for pool analysis I just wanted to confirm if the relative allele frequency required as input data for pool analysis is equals to REF allele counts/(REF + ALT counts) for each pool and SNP.

Correct

apfuentes commented 6 years ago

Thanks very much for your response and clear explanations, much appreciate it! Best wishes,

apfuentes commented 6 years ago

Hi, I run pcadapt on 3 pools (each pool obtained following the Pool-seq approach) for which two PCs were retained. Then I checked the singular.values of those PCs to calculate the percentage of the variance explained by each PC using the formula round(((res$singular.values)^2) / length(res$maf),digits = 2). To my surprise the sum of the two PCs is greater than 100% (PC1: 97.39%, PC2: 93.47%). Maybe this is a bug?

Code used:

> summary(res)
                Length   Class  Mode   
scores                 6 -none- numeric
singular.values        2 -none- numeric
loadings        14157356 -none- numeric
zscores         14157356 -none- numeric
af               7078678 -none- numeric
maf              7078678 -none- numeric
chi2.stat        7078678 -none- numeric
stat             7078678 -none- numeric
gif                    1 -none- numeric
pvalues          7078678 -none- numeric
pass             7078678 -none- logical
> res$scores
           [,1]       [,2]
[1,]  0.6481888  0.4965057
[2,]  0.1058922 -0.8096008
[3,] -0.7540810  0.3130951
> res$singular.values
[1] 26256.92 25722.40
> pct_pc <- round(((res$singular.values)^2) / length(res$maf),digits = 2)
> pct_pc
[1] 97.39 93.47

Thanks.

mblumuga commented 6 years ago

The formula to get the percentage of variance explained is more complex because allele frequencies are not scaled for Pool-seq data. This being said, percentage of variance explained is not useful here because the number of data points (pool) is too small w.r.t. to the number of variables to have a good estimate of these quantities.

apfuentes commented 6 years ago

OK, thanks very much for the clarification. Would still be valid to use pcadapt for the identification of SNPs differentiating these 3 populations?

Also, I would be interested on using the formula to estimate the percentage of variance explained per PC in Pool-seq data. Would it be possible to share it, please? Thanks

mblumuga commented 6 years ago

Would still be valid to use pcadapt for the identification of SNPs differentiating these 3 populations?

Sure

Also, I would be interested on using the formula to estimate the percentage of variance explained per >PC in Pool-seq data. Would it be possible to share it, please? Thanks

When I find it, I will give it to you.

bcm-uga / pcadapt

Error using read.pcadapt(pool.data, type = "pool") - long vectors not supported yet #11