issue by data during fine alignment

gjrodrigo commented 3 years ago

Good morning,

as it is my first time posting I am not sure which kind of data do you need from me. I am having some problem during data saving after my alignment in the general discovery script.

My data is much larger than the examples and then during the last stages of the alignment it uses a lot more RAM memory than it can handle and shuts down the process. As I am just learning this script i am not sure which variables I can clear to make less use of the RAM before saving the aligned data, could I for example remove the coarse and asinh transformed data already?. Can anyone there give me a little insight into this?

Thank you very much

best regards Rodrigo Gutierrez

SamGG commented 3 years ago

I think it would be hellpful to give the amount of cells and markers, and the amount of RAM on the other side. Apart from that, I have no idea.

tomashhurst commented 3 years ago

Hey @gjrodrigo, I'm definitely sympathetic to this particular problem, and well done for spotting it. I have a couple of suggestions, in order from fastest to slowest.

One option is to only read a certain number of cells per file. In the read.files function, you can specify how many rows (i.e. cells) to read from each file using the nrows argument (Spectre v0.5 or above).

        data.list <- Spectre::read.files(file.loc = InputDirectory,
                                         file.type = ".csv",
                                         nrows = 10000,
                                         do.embed.file.names = TRUE)

This is probably the quickets because it involves minimal change to your workflow script. Depending on what your cellular composition is like, you could probably get away with this with minimal fuss.

After asinh transformation, for example, you could write the dataset to disk and then delete the 'raw' data from the data.table before proceeding to coarse alignment. It's great to have all the data available in the one table, but this obviously does put some pressure on the memory capacity. If that's not enough, you could do the same for the asinh transformed data once you've completed the coarse alignment, etc. If you do this, make sure to run gc() which will free up any suspended memory (just deleting the data doesn't always do it straight away).
If you use the updated alignment workflow we end up skipping the coarse alignment and going straight to CytoNorm, which will save some space in RAM,
As an alternative, we have a version of this workflow which processes the data in 'chunks' -- data can be read from the disk, analysed, and then written back to the disk. For alignment using CytoNorm this is fairly straightforward, as the same model can be applied to multiple chunks of data independently. We can then utilise a classifier to allow this to work for clustering etc. I didn't put that workflow online when we developed it as we were pre-occupied with wrapping up the study we were using it for. However it's on the list for our next set of workflows to go online (i.e. this one is explicitly designed for larger-than-RAM datasets). This would be the longest lead time, as we would probably have that ready to go in a week or two. We also have designs to do this with HDF5 files (i.e. the data is stored on disk, but mapped virtually in memory), but that's a much longer lead time.

Tom

gjrodrigo commented 3 years ago

Thank you very much Tom,

It helped a lot. I will be on the lookout for the new batch of workflow because most of my data is way bigger than RAM.

Rodrigo

tomashhurst commented 3 years ago

@ghar1821 (and @SamGG ),

I'm testing out a workflow we plan to implement in Spectre v2 that uses HDF5 files to keep the data on disk, and pull in chunks of it for processing. It works a little differently to the conventional approaches, but could be quite helpful. Would you be interested in trying it out? It's not up and running yet, but we probably aren't too far from having an alpha version you could try?

Tom

gjrodrigo commented 3 years ago

Hi Tom,

I would love to try it when it is available.

Thank you

Rodrigo

ImmuneDynamics / Spectre

issue by data during fine alignment #50