Redundancy in output - Githubissues

sthawinke commented 4 years ago

@AlemuTA Here I am again. I have a problem with memory in my simulations, using result.format = "list". I think a lot of return arguments are redundant, as they are basically the same for every MC run, e.g. the column data batch and group information. More importantly in terms of memory, the density information in the SPsim.est.densities is largely redundant, as it is the same every time for each gene right? I can make these changes if you agree. I would basically split up the result in a general part, and a list with MC run-specific information

AlemuTA commented 4 years ago

Yes, the estimated densities remain the same for each gene across simulations. I was already thinking about minimizing such outputs or at least avoiding the redundancies. Since the current version (v2.0.1) is mainly for your question, I tried to make the function return all available results. If you have the solution already, please send me a pull request.

sthawinke commented 4 years ago

Hi, just to make sure, is it true that batch effect estimation and density estimation currently occur within the MC loops? Is there a specific reason for this? Otherwise only estimating these things once could provide a major speed-up

AlemuTA commented 4 years ago

Hi Stijn, I'm not sure if I understood your question very well. There is no batch effect estimation first of all. SPsimSeq estimates the densities separately for each batch given the source data has samples from multiple batches. Note that this applies to all features (null and predictor/DE). This procedure occurs only one time not for every MC.

sthawinke commented 4 years ago

But in the following statement in SPsimSeq.R on lines 277-283:

# estimate batch specific parameters
est.list <- lapply(sel.genes, function(ii){
  #print(i)
  gene.parm.est(cpm.data.i = cpm.data[ii, ], batch = batch, group = group, 
                null.group = null.group, sub.batchs = sub.batchs, de.ind = DE.ind[ii], 
                model.zero.prob = model.zero.prob, min.val = min.val, w=w, ...)
})

gene-wise parameters are being estimated right?

It is inside the loop starting on line 267, so I wonder if it could be moved outside, as it concerns parameter estimation.

AlemuTA commented 4 years ago

whoops! I see what you mean. I think you can move it out of that loop. However, it may require to estimate the parameters for every feature in the source data.

sthawinke commented 4 years ago

Ok I am working on it, hope to finish today. Another question: is it correct that library sizes are only generated once now in the prepareSPsimOutputs() function, and then used in all MC runs ("LL")?

AlemuTA commented 4 years ago

Yes, they are generated in the prepareSourceData() function once. Do you suggest to simulate them for every loop?

sthawinke commented 4 years ago

Well that is more a scientific question, but I would suggest to do so yes, as it better explores the whole parameter space. But that is minor, I will open a separate issue and leave it for later.

AlemuTA commented 4 years ago

The new version (v2.0.2) includes optimized outputs (based on your proposal). However, in order to make it consistent with the previous releases and the output format of other simulation methods (such as Splatter), there is a little change on your proposed code. In this way, we can still return the detailed results (only one time) depending on the argument SPsimSeq(...,return.details=TRUE,...). Please see the dSPsimSeq() function to review the output format in practice. However, optimal solutions are still welcome.

sthawinke commented 4 years ago

I think this has worked out fine, both in computational load and memory. I am closing the issue now.

CenterForStatistics-UGent / SPsimSeq

Redundancy in output #5