cozygene / TCA

17 stars 11 forks source link

Problem with parallel = TRUE and large datasets, when running TCA #1

Closed m-nabais closed 5 years ago

m-nabais commented 5 years ago

I am having some difficulties when trying to make use of your implementation of the parallel R Package, when running TCA, using a multi-core machine in a large dataset (N > 1400). The code seems to get stuck in the iteration process for this large dataset, only when setting parallel = TRUE, but not when setting parallel = FALSE. Could this be an error in the code? Or perhaps I might be missing something. I seem to be able to run it smoothly in smaller datasets, setting either parallel = TRUE or parallel = FALSE.

Thanks in advance!

E-R commented 5 years ago

Can you please attach the exact command you're running before it gets stuck and the output log file after setting debug = TRUE ?

m-nabais commented 5 years ago

Hi E-R,

Yes, here's a screenshot of the output log file:

Screen Shot 2019-08-06 at 10 46 57 AM

The command I'm running is:

tca.mdl <- tca(X = betas, W = ctp, C1 = c1, C2 = c2, parallel = TRUE, log_file = "RA_GSE42861.tca.log", debug = T )

Hope this helps?

Thanks!

E-R commented 5 years ago

The fact that you can run the same code on smaller data implies that you are using insufficient memory resources in your execution (although the log error does not reflect that).

Note that the TCA model assumes independence between features (sites). You can therefore always break your data matrix into chunks, each having a different set of features (e.g., in the case of methylation you can break into chromosomes), and run TCA on each chuck - this will require much less memory for a given chunk.

One thing to keep in mind - breaking your data into chunks as described above makes sense only if you are not learning W (i.e. by setting refit_W = FALSE). If refit_W = TRUE then we create dependencies between the different features in the learning process and we therefore cannot break the data into chunks. We can overcome this if we first learn refit_W from a small subset of features that are informative with respect to W (e.g., in the case of methylation, we can typically use a couple of hundreds of features that are highly informative for learning the cell-type composition W) - see parameters refit_W.features and refit_W.sparsity for more details. Given the new estimate of W, we can then break the data into chunks and use the new W while using refit_W = FALSE for each chunk.

hhp94 commented 1 year ago

Dear Professor Rahmani,

Following your comment above, I'm attempting to write a wrapper to perform tca() over chunks of the data matrix X (CpG sites). I noticed that in ./R/model_fit.R/tca.fit_means_vars, the estimation of tau_hat is using information from X. Hence, each fit of tca() to a chunk of the data matrix X would return an estimate of tau_hat. Does this mean there are dependencies between the sites induced by tau_hat?

I used test_data with test_data(30, 200000, 6, 1, 1, 0.01) and tested between a sequential run and a parallel run by chunking X into 21 chunks. The parallel run returned 21 values of tau_hat very close to the truth of 0.01. Other than tau_hat, the estimates are correlated > 0.999 between the two methods. However, I'm not too sure if I can chunk the data using real data.

Thank you so much and looking forward to hearing back from you Sincerely, Hung

E-R commented 1 year ago

tau models a component of iid variation that affects all CpGs in the data. The dependency in all features in the data is therefore coming from using all CpGs for estimating tau, however, we probably don't need a very large number of CpGs in order to estimate just a single variable, so in cases where splitting the data is needed for practical reasons I think it's fine to neglect this dependency. If you want to be extra cautious then I would randomly split the CpGs into chunks (rather than by chromosome num or by any other arbitrary criterion). You can also use the tau argument to provide the models of all chunks with the same tau if you want to guarantee consistency between the models (and set it, for example, to tau_est from the first model) but I don't think this is necessary/better.

hhp94 commented 1 year ago

Dear Professor Rahmani,

I've written an initial wrapper for running tca() over chunks of X at my fork of TCA. In combination with some optimizations, I achieved a 13.3x speed up with tca() where the returned estimates have cor > 0.99 with those of a sequential run.

The optimizations are as follows. More details are in this report

  1. replacing lm(), and anova() calls with RcppEigen::fastLm() and a minimal implementation of the partial F-test. lm.fit() can be used instead of RcppEigen::fastLm() if minimal dependencies are desired.
  2. replacing pracma::repmat() calls with MESS::repmat() calls. MESS::repmat() is significantly faster and more resource efficient. This change adds up when the optimizations repeatedly call repmat().

The wrappers are as follows. Details about behavior of tca_split() is in this report

  1. split_input() takes in the X matrix and splits it into roughly equal chunks and shuffles of rows as you suggested. I also added a print method to prevent Rstudio from hanging by printing out list of huge matrices. Also added a dim method.
  2. tca_split() which is a wrapper around tca() but replacing the X input with the input from split_input(). The parallel back end is from the {future} package to facilitate possible parallelization over multiple nodes on HPC with {future.batchtools}. Can also change to {parallel}.

Thank you so much for your guidance. Reading your codes has been a learning experience for me. I'm still relatively new to R and open-source software so please excuse any messiness. I can write PRs if needed since I also have most of the unit tests also written up.

Sincerely,

E-R commented 1 year ago

This sounds great! I'm traveling this week. Let me take a look into your fork when I'm back (next week).