andreyshabalin / MatrixEQTL

Matrix eQTL: Ultra fast eQTL analysis via large matrix operations
53 stars 16 forks source link

Thousands of results are NA #4

Closed lcolladotor closed 6 years ago

lcolladotor commented 6 years ago

Hi,

Using MatrixEQTL version 2.2 I'm running into a weird result that might be resolved by lowering the p-value threshold (as in https://github.com/andreyshabalin/MatrixEQTL/issues/1). Still, maybe you've seen this before and have some idea on what could be done.

The issue is that many of the results in me$cis$eqtls (I saved the output of Matrix_eQTL_main() to the object me) are NA, including the snp and gene ids. So I can't check if either the snp or the gene id has some weird property.

Weird results

> table(!is.na(me$cis$eqtls$snps))

  FALSE    TRUE
9431253   23828
> head(me$cis$eqtls[is.na(me$cis$eqtls$snps), ])
      snps gene statistic pvalue FDR beta
23829 <NA> <NA>        NA     NA  NA   NA
23830 <NA> <NA>        NA     NA  NA   NA
23831 <NA> <NA>        NA     NA  NA   NA
23832 <NA> <NA>        NA     NA  NA   NA
23833 <NA> <NA>        NA     NA  NA   NA
23834 <NA> <NA>        NA     NA  NA   NA

Call used:

## Call used:
me <- Matrix_eQTL_main(snps = meth, gene = exprinfo, 
    output_file_name.cis = paste0('.', cpg, '_', opt$feature,
        '.txt'), # invis file, temporary
    pvOutputThreshold = 0, pvOutputThreshold.cis = 5e-4, 
    useModel = modelLINEAR,
    snpspos = methpos, genepos = exprpos, cisDist = 1000)

File checks

> table(is.na(exprpos$spliceid))

FALSE
60252
> table(is.na(exprpos$chr))

FALSE
60252
> sapply(exprpos, class)
   spliceid         chr       start         end
"character" "character"   "integer"   "integer"

> table(is.na(methpos$cname))

   FALSE
18664892
> table(is.na(methpos$chr))

   FALSE
18664892
> sapply(methpos, class)
      cname         chr         pos
"character" "character"   "integer"

> identical(methpos$cname, rownames(meth))
[1] TRUE
> identical(exprpos$spliceid, rownames(exprinfo))
[1] TRUE

The code works as expected with one methylation data set but not in a second one: both use the same gene information and have the same number of samples. The second one has more non-zero entries in the meth object and we had to use a lower file slice size (meth$fileSliceSize <- 300 instead of 2000) to keep the memory from blowing up. That's why I'm guessing that lowering the p-value threshold could help. If needed, I can see if this happens with the data from a small chromosome and send you the files via email (like we did for https://github.com/andreyshabalin/MatrixEQTL/issues/3).

Best, Leonardo

andreyshabalin commented 6 years ago

Leonardo,

Having any reproducible example which I could use to investigate would be great.

Andrey

lcolladotor commented 6 years ago

Hi Andrey,

Here's a quick update. I tried reproducing the above error using the data for:

and it worked without problems in all these scenarios. I'm re-running the full jobs using all the data (they've been running for the past 36 hours) to see if I can reproduce the error in that case. However, I now have a suspicion to what might have been the real issue. While I was running the jobs that lead to the reported error, I updated my MatrixEQTL installation (to check this other issue https://github.com/andreyshabalin/MatrixEQTL/issues/3). So my guess right now is that re-installing MatrixEQTL lead to this situation.

In any case, out of the current 4 jobs that I'm running one of them reports that it's 60% done, so hopefully we'll know the answer in 2 days or less.

Best, Leonardo

lcolladotor commented 6 years ago

Hi again,

Sadly, the job ended with the same result despite using the latest version from GitHub. This time I did include a call to warnings() after running Matrix_eQTL_main().

99.50% done, 9,455,047 cis-eQTLs
100.00% done, 9,455,081 cis-eQTLs
Task finished in 262481.375 seconds

There were 50 or more warnings (use warnings() to see the first 50)
Warning messages:
1: In (sn.l[x]:(sn.r[x] - 1)) * nrow(statistic) :
  NAs produced by integer overflow
2: In (sn.l[x]:(sn.r[x] - 1)) * nrow(statistic) + x :
  NAs produced by integer overflow
## Plus more of the same

[1] "neqtls and ntests"
[1] 9455081
[1] 14851692

plus another check I added:

[1] "Number of meQTLs without row ids"

  FALSE    TRUE
9431253   23828

from running

print('neqtls and ntests')
me$cis$neqtls
me$cis$ntests

print('Number of meQTLs without row ids')
table(!is.na(me$cis$eqtls$snps))

So it seems like this issue can only be reproduced with the full data and that has to do with integer overflows.

I can report that if I subset all chrs to certain genes (about 26% of the original 60,252) and certain positions of the genome (about 37% of the original 18,664,892), it all works well even despite using a higher pvOutputThreshold.cis (0.01 instead of 5e-4):

[1] "neqtls and ntests"
[1] 237525
[1] 7842494
[1] "Number of meQTLs without row ids"
  TRUE
237525

So maybe there's a limit on the number of neqtls or ntests.

Best, Leonardo

Session info

Main part:

 MatrixEQTL * 2.2     2018-01-26 Github (andreyshabalin/MatrixEQTL@454aacd)

Full list:

> library(MatrixEQTL)
> library(devtools)
> options(width = 120)
> session_info()
Session info ----------------------------------------------------------------------------------------------------------
 setting  value
 version  R version 3.4.3 Patched (2018-01-20 r74142)
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 tz       US/Eastern
 date     2018-02-02

Packages --------------------------------------------------------------------------------------------------------------
 package    * version date       source
 base       * 3.4.3   2018-01-20 local
 colorout   * 1.1-2   2017-08-10 Github (jalvesaq/colorout@020a14d)
 compiler     3.4.3   2018-01-20 local
 datasets   * 3.4.3   2018-01-20 local
 devtools   * 1.13.4  2017-11-09 CRAN (R 3.4.2)
 digest       0.6.14  2018-01-14 CRAN (R 3.4.2)
 graphics   * 3.4.3   2018-01-20 local
 grDevices  * 3.4.3   2018-01-20 local
 MatrixEQTL * 2.2     2018-01-26 Github (andreyshabalin/MatrixEQTL@454aacd)
 memoise      1.1.0   2017-04-21 CRAN (R 3.4.1)
 methods    * 3.4.3   2018-01-20 local
 stats      * 3.4.3   2018-01-20 local
 tools        3.4.3   2018-01-20 local
 utils      * 3.4.3   2018-01-20 local
 withr        2.1.1   2017-12-19 CRAN (R 3.4.2)
andreyshabalin commented 6 years ago

What is the slicesize you are using loading 'gene' and 'snps'?

andreyshabalin commented 6 years ago

I've committed a small update fixing this integer overflow. I hope it fixed the whole problem.

lcolladotor commented 6 years ago

Hi Andrey,

The file slice size was 300 for both the 'gene' and the 'snps'. Also, thanks for https://github.com/andreyshabalin/MatrixEQTL/commit/b9a9f0197bce3ee39bbfd6ebb59b47f19f6a4509. I re-submitted my 4 jobs and will report back once at least 1 of them is done.

Best, Leonardo

andreyshabalin commented 6 years ago

Oh, 300 is way too small. I'd suggest at least 3000. It should not affect the memory consumption too much.

In any case, I'll be waiting for your updates.

andreyshabalin commented 6 years ago

With the narrow local-band (cisDist = 1000) you use, I'd say slice size of 300 is reasonable.

lcolladotor commented 6 years ago

Hi Andrey,

https://github.com/andreyshabalin/MatrixEQTL/commit/b9a9f0197bce3ee39bbfd6ebb59b47f19f6a4509 fixed this issue, thanks!

Best, Leonardo

lcolladotor commented 5 years ago

Thank you again @andreyshabalin!

https://twitter.com/fellgernon/status/1135944877554798592