andreyshabalin / MatrixEQTL

Matrix eQTL: Ultra fast eQTL analysis via large matrix operations
`findInterval` error in certain workflows #3

andrewejaffe closed 6 years ago

andrewejaffe commented 6 years ago

We've only seen this error when running MatrixEQTL on the exon-exon splice junction matrix. All of the junctions are sorted

2018-01-25 14:16:09 running MatrixEQTL
Matching data files and location files
4360453of4360453 genes matched
58109566of58109566 SNPs matched

Task finished in 36.62 seconds
Reordering SNPs

Task finished in 286.049 seconds
Reordering genes

Task finished in 33.687 seconds
Processing covariates
Task finished in 0.00199999999995271 seconds
Processing gene expression data (imputation, residualization)
Task finished in 7.25300000000004 seconds
Creating output file(s)
Error in findInterval(sn.l, ge.r + cisDist + 1) : 
  'vec' must be sorted non-decreasingly and not contain NAs
Calls: Matrix_eQTL_main -> findInterval
Execution halted

We've seen this a few times for different datasets, and usually removing the more lowly expressed junctions will allow the code to run but we'd like to better troubleshoot this issue.

andreyshabalin commented 6 years ago

I'd like to investigate. Can you share the gene/SNP location files and gene expression and genotype data sets (with data zeroed out, to avoid breaking any data sharing rules)?

andrewejaffe commented 6 years ago

Here is an example datasets of chr21 that fails with the same error message

> message(paste(Sys.time(), 'running MatrixEQTL'))
2018-01-26 14:03:59 running MatrixEQTL
> me <- Matrix_eQTL_main(snps = meth, gene = exprinfo,
+     output_file_name.cis = paste0('.', cpg, '_', opt$feature,
+         '.txt'), # invis file, temporary
+     pvOutputThreshold = 0, pvOutputThreshold.cis = 5e-4,
+ useModel = modelLINEAR,
+ snpspos = methpos, genepos = exprpos, cisDist = 1000)
Matching data files and location files
43436of43436 genes matched
271233of271233 SNPs matched

Task finished in 0.355999999999767 seconds
Reordering genes

Task finished in 9.52599999999984 seconds
Processing covariates
Task finished in 0.00300000000015643 seconds
Processing gene expression data (imputation, residualization)
Task finished in 0.219000000000051 seconds
Creating output file(s)
Error in findInterval(sn.l, ge.r + cisDist + 1) :
  'vec' must be sorted non-decreasingly and not contain NAs
In addition: Warning message:
In .Internal(gc(verbose, reset)) :
  closing unused connection 3 (.CpG_jx.txt)
> save(meth, exprinfo, methpos, exprpos, cpg, opt, file =

> options(width = 120)
> session_info()
Session info
 setting  value
 version  R version 3.4.3 Patched (2018-01-20 r74142)
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 tz       America/New_York
 date     2018-01-26

andreyshabalin commented 6 years ago

I see the error message, not the data. Can you send me the data? Maybe send it directly

andreyshabalin commented 6 years ago

I believe I've fixed the problem (commit). Please try it now. Thank you for your help.

lcolladotor commented 6 years ago

Hi Andrey,

I can report that your latest commit indeed fixed the problem.

Best, Leonardo

(PS I work with Andrew)