Long Vector Error when compiling results

AndrewSkelton commented 7 years ago

Hi,

I've ran MatrixEQTL (SNPs = 561,963, Expression = 422,070, No of Samples = 87). At the last stage, after the 100% completion message, I get the following error:

Error in pmin.int(x, val) : 
  long vectors not supported yet: ../../src/include/Rinlinedfuns.h:138

I've tried in Rmd and R scripts, to see if that made the difference.

Any suggestions?

R Version: 3.3.2 (Sincere Pumpkin Patch) RAM on machine: 256GB OS: linux-gnu (x86_64) - Ubuntu 16.04.1 LTS

sessionInfo()

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rcpp_0.12.9      ggplot2_2.2.1    dplyr_0.5.0      readr_1.0.0      MatrixEQTL_2.1.1

loaded via a namespace (and not attached):
 [1] assertthat_0.1   grid_3.3.2       R6_2.2.0         plyr_1.8.4       gtable_0.2.0     DBI_0.5-1        magrittr_1.5     scales_0.4.1     lazyeval_0.2.0   tools_3.3.2      munsell_0.4.3    colorspace_1.2-6 tibble_1.2

andreyshabalin commented 7 years ago

Hi Andrew,

Would you mind lowering the p-value threshold from 0.01 in the sample code to a lower one more appropriate for the large number of gene-SNP pairs you have?

AndrewSkelton commented 7 years ago

Sure - I did mean to change that based on the warning it gave. Mechanistically, is it the large number of results passing the p0.01 threshold that's causing the long vector error? Any ballpark suggestions on an appropriate cutoff? - 1e-4, or maybe a bit more stringent at 1e-6?

andreyshabalin commented 7 years ago

Yes, my best guess is that the number of significant results was huge. I suggest setting threshold to 1e-7 and checking the QQ-plot. You might have inflation of test statistics also causing increased number of significant results. The QQ-plot will tell.

AndrewSkelton commented 7 years ago

Thanks - I'll report back soon

andreyshabalin commented 7 years ago

Excuse the brevity of my previous responses, I was commenting from a cell phone.

When we use 1% p-value threshold for your data set, we should expect at least 561,963 x 422,070 x 0.01 tests passing the threshold just by chance, which is 2,371,877,234. Vectors over 2^31 elements long (about 2 billion) are called long vectors in R and some operations are still not supported for them.

Matrix eQTL was designed expecting much smaller sets of top results. I would hope you do not actually need 2 billion top tests (let me know if I'm mistaken).

AndrewSkelton commented 7 years ago

Not a problem - I appreciate the swift response.

It's currently running at ~26% completion after ~2hrs, and the results thus far seem much more sensible. I agree that >2E9 top tests does not make sense at all, and certainly is not expected in the datasets I've got. Out of curiosity will increasing the size of the slices reduce the computation time? - (I realise it's a lot of tests to do in the first place!)

andreyshabalin commented 7 years ago

Matrix eQTL performs analysis via large matrix operations, by considering sequentially one gene expression slice vs. one genotype slice. While the performance clearly depends on the slice size, it severely deteriorates only for slice sizes below 100. The difference in performance between slice size of 1,000 and 5,000 is already very minor, I'd guess below 10%. Bigger slice sizes also cause larger memory requirements (to store and process the matrix product).

The major performance boost can be achieved by using a fast matrix multiplication library in R. See http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/runit.html#large

andreyshabalin commented 7 years ago

Microsoft R (former Revolution R) is available for Ubuntu 16.04. https://mran.microsoft.com/documents/rro/installation/

AndrewSkelton commented 7 years ago

All ran well, the p value thresholds were definitely the root of the problem, and definitely something I should have picked up on myself! Thanks for the assistance, it's a fantastic package, and looking at your code it's really interesting how you've implemented it!

Thanks for the tip with BLAS, switching to openblas made the runtime go from ~7.5hrs to ~4.25, which is a huge performance increase!

andreyshabalin / MatrixEQTL

Long Vector Error when compiling results #1