andreyshabalin / MatrixEQTL

Matrix eQTL: Ultra fast eQTL analysis via large matrix operations
58 stars 17 forks source link

'long vectors not supported yet' on larger dataset #6

Closed jmal0403 closed 6 years ago

jmal0403 commented 6 years ago

Hello,

I've had a lot of success running MatrixEQTL on fairly large datasets in the past. Currently, I'm attempting to run "Matrix_eQTL_engine" for ~600k SNPs and ~30k genes with 500 subjects (CentoOS with 256GB RAM). I did read your thread with @AndrewSkelton, and my first reaction was to lower the p-value. I tried lowering to 1e-4,1e-6, and 1e-9; however, I still get the following error: long vectors not supported yet: ../../src/include/Rinlinedfuns.h:138. I also reduced my dataset down to 100k SNPs and 20k genes and that ran with no issues. So, I'm really not sure how to proceed.

Please let me know if I can provide any additional information. Any help is greatly appreciated.

Best, John

Here is the sessionInfo:

Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS: /usr/local/lib64/R/lib/libRblas.so LAPACK: /usr/local/lib64/R/lib/libRlapack.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] vegan_2.4-6 lattice_0.20-35
[3] permute_0.9-4 snpStats_1.26.0
[5] Matrix_1.2-9 survival_2.41-3
[7] plyr_1.8.4 VariantAnnotation_1.22.0
[9] Rsamtools_1.28.0 Biostrings_2.44.0
[11] XVector_0.16.0 SummarizedExperiment_1.6.1 [13] DelayedArray_0.2.0 matrixStats_0.53.1
[15] Biobase_2.36.1 GenomicRanges_1.28.1
[17] GenomeInfoDb_1.12.0 IRanges_2.10.0
[19] S4Vectors_0.14.0 BiocGenerics_0.22.0
[21] MatrixEQTL_2.2

loaded via a namespace (and not attached): [1] Rcpp_0.12.17 compiler_3.4.0 GenomicFeatures_1.28.0
[4] prettyunits_1.0.2 bitops_1.0-6 tools_3.4.0
[7] zlibbioc_1.22.0 progress_1.1.2 biomaRt_2.35.11
[10] digest_0.6.15 nlme_3.1-131 BSgenome_1.44.0
[13] RSQLite_1.1-2 memoise_1.1.0 mgcv_1.8-17
[16] DBI_0.6-1 GenomeInfoDbData_0.99.0 cluster_2.0.6
[19] rtracklayer_1.36.6 stringr_1.3.1 httr_1.3.1
[22] grid_3.4.0 R6_2.2.2 AnnotationDbi_1.38.2
[25] XML_3.98-1.10 BiocParallel_1.10.1 magrittr_1.5
[28] MASS_7.3-47 splines_3.4.0 GenomicAlignments_1.12.0 [31] assertthat_0.2.0 stringi_1.2.3 RCurl_1.95-4.10

andreyshabalin commented 6 years ago

Hi John,

I've simulated a data set of the same size as yours and ran Matrix eQTL on it.

It ran without a problem.

library("MatrixEQTL");

nsamp =    500;
ngene =  30000;
nsnps = 600000;

### Load gene expression info
genemat = matrix(runif(ngene*nsamp), ncol = nsamp)
gene = SlicedData$new(genemat);
gene$ResliceCombined(sliceSize = 5000)

### Load covariates
cvrtmat = matrix(runif(20*nsamp), ncol = nsamp)
cvrt = SlicedData$new(cvrtmat);

### Load genotype info
snpsmat = matrix(runif(nsnps*nsamp), ncol = nsamp)
snps = SlicedData$new(snpsmat);
snps$ResliceCombined(sliceSize = 5000)

### Run Matrix eQTL
me = Matrix_eQTL_main(
    snps = snps,
    gene = gene,
    cvrt = cvrt,
    output_file_name = NULL,
    pvOutputThreshold = 1e-6,
    useModel = modelLINEAR,
    errorCovariance = numeric(),
    verbose = TRUE,
    pvalue.hist = FALSE);

The output:

# Processing covariates
# Task finished in 0.02 seconds
# Processing gene expression data (imputation, residualization)
# Task finished in 0.76 seconds
# Creating output file(s)
# Task finished in 0 seconds
# Performing eQTL analysis
#  0.13% done, 25 eQTLs
# ...
# 100.00% done, 18,817 eQTLs
# Task finished in 492.86 seconds
andreyshabalin commented 6 years ago

John,

Can you show me the code and full output with the error message? How many significant eQTLs do you get in the process?

Andrey

jmal0403 commented 6 years ago

Andrey,

First, thanks for getting back to me and I apologize for the delay. This is a busy time of year for me. Great new! It's working now. Let me show you what I was doing and what changed:

Genotype Matrix

snps = SlicedData$new(SNPs); snps$fileSliceSize = 1000; snps$fileOmitCharacters = "NA";

Expression Matrix

genes = SlicedData$new(datExpr.mat); genes$fileOmitCharacters = "NA"; genes$fileSliceSize = 1000;

So, I just changed the following and reran: genes$ResliceCombined(sliceSize = 5000) snps$ResliceCombined(sliceSize = 5000)

So, was I using the wrong fileSlice() attribute?

Thanks so much! John

andreyshabalin commented 6 years ago

Writing the line

genes$fileSliceSize = 1000;

does not 'reslice' the matrix. You had to call $ResliceCombined() explicitly.

jmal0403 commented 6 years ago

I see. Thanks so much for your help. Be well, John