Closed AndrewSkelton closed 7 years ago
Hi Andrew,
Would you mind lowering the p-value threshold from 0.01 in the sample code to a lower one more appropriate for the large number of gene-SNP pairs you have?
Sure - I did mean to change that based on the warning it gave. Mechanistically, is it the large number of results passing the p0.01 threshold that's causing the long vector error? Any ballpark suggestions on an appropriate cutoff? - 1e-4, or maybe a bit more stringent at 1e-6?
Yes, my best guess is that the number of significant results was huge. I suggest setting threshold to 1e-7 and checking the QQ-plot. You might have inflation of test statistics also causing increased number of significant results. The QQ-plot will tell.
Thanks - I'll report back soon
Excuse the brevity of my previous responses, I was commenting from a cell phone.
When we use 1% p-value threshold for your data set, we should expect at least 561,963 x 422,070 x 0.01 tests passing the threshold just by chance, which is 2,371,877,234. Vectors over 2^31 elements long (about 2 billion) are called long vectors in R and some operations are still not supported for them.
Matrix eQTL was designed expecting much smaller sets of top results. I would hope you do not actually need 2 billion top tests (let me know if I'm mistaken).
Not a problem - I appreciate the swift response.
It's currently running at ~26% completion after ~2hrs, and the results thus far seem much more sensible. I agree that >2E9 top tests does not make sense at all, and certainly is not expected in the datasets I've got. Out of curiosity will increasing the size of the slices reduce the computation time? - (I realise it's a lot of tests to do in the first place!)
Matrix eQTL performs analysis via large matrix operations, by considering sequentially one gene expression slice vs. one genotype slice. While the performance clearly depends on the slice size, it severely deteriorates only for slice sizes below 100. The difference in performance between slice size of 1,000 and 5,000 is already very minor, I'd guess below 10%. Bigger slice sizes also cause larger memory requirements (to store and process the matrix product).
The major performance boost can be achieved by using a fast matrix multiplication library in R. See http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/runit.html#large
Microsoft R (former Revolution R) is available for Ubuntu 16.04. https://mran.microsoft.com/documents/rro/installation/
All ran well, the p value thresholds were definitely the root of the problem, and definitely something I should have picked up on myself! Thanks for the assistance, it's a fantastic package, and looking at your code it's really interesting how you've implemented it!
Thanks for the tip with BLAS, switching to openblas made the runtime go from ~7.5hrs to ~4.25, which is a huge performance increase!
Hi,
I've ran MatrixEQTL (SNPs = 561,963, Expression = 422,070, No of Samples = 87). At the last stage, after the 100% completion message, I get the following error:
I've tried in Rmd and R scripts, to see if that made the difference.
Any suggestions?
R Version: 3.3.2 (Sincere Pumpkin Patch) RAM on machine: 256GB OS: linux-gnu (x86_64) - Ubuntu 16.04.1 LTS
sessionInfo()