ebecht / infinityFlow

25 stars 8 forks source link

Questions/improvements #4

Open jacobpwagner opened 3 years ago

jacobpwagner commented 3 years ago

Hi @ebecht! I hope all is well with you. I am looking at integrating infinityFlow as part of larger modular workflows and had some questions about possibly adding flexibility to certain steps.

1) If no isotype arg is passed to the top-level infinity_flow function, it seems like the source code indicates that this should maybe be tolerated. I would think the expected behavior in that case would be to just skip isotype-specific background correction. However, it doesn't look like that is possible as background correction does not appear to be optional. Furthermore, a missing arg to the top-level infinity_flow function is evaluated to the NULL default upon being passed to initialize so the conditional of that check will never be FALSE. So, I guess the main question I have here is whether there is any way to bypass isotype-specific background correction as the package is currently written. If not, could it be added?

2) The scaling transformation for all channels is locked at a logicle transformation with specific parameter values. Could this be made more flexible via taking a list of transform objects for the different channels?

3) Similarly, could UMAP embedding be optional? I would already be performing this as a separate step in a pipeline, so it results in unnecessary/redundant computation.

Lastly, the memory usage seems pretty high, to the point that I cannot run it for some large datasets without heavy downsampling. I can see what the best solution would be for the model training and prediction portions, as that will depend a bit on the different methods. But at least for the transformation portion, did you look at using flowWorkspace::transform on cytoframe/cytoset objects instead of converting to a data.frame, splitting, and transforming each column manually? Letting flowWorkspace manage operations on the frame data as an H5 file may reduce R's memory usage.

I realize that's a lot and may be out of the scope/intended use. Let me know if it would help for me to submit a pull request implementing any changes.

ebecht commented 3 years ago

Hi @jacobpwagner, good to hear from you. Thank you for taking the time to look at the source code in details.

  1. That makes sense and I was also thinking about making this optional as depending on the type of sample assayed, background correction may not be necessary. Thanks for pointing out the issue with the call to missing(). I will update the code to make background-correction optional.
  2. This part is actually a small modification from flowCore::estimateLogicle which has been working well for me. The goal was to make it easy for users that have little R experience to use the package by taking care of transformation automatically. I think it is not super straightforward to support a user-provided transformList. One needs transformation parameters for every backbone channel and then for every PE channel of every input file. Happy to review a pull request for this if you are keen, but I would like to leave the current behavior as the default and not complicate the input arguments too much if possible.
  3. Yes that makes sense, I think it is quite doable to make UMAP and the plotting optional. I'll do that.
  4. For flowWorkspace, are you sure that the high memory usage you are observing is actually due to that step? About how many cells were you trying to process? I think it gets much worse afterwards when a dense matrix with a number of elements n_output_events x n_input_files2 is generated? I would like to keep the number of dependencies low so am a bit reluctant to add flowWorskpace.

Let me know what you think about 2. and 4. ! Best, Etienne

jacobpwagner commented 3 years ago

Regarding point 2, I totally agree that having the current behavior remain the default makes total sense, so it would have to be an optional list structure arg. But it's definitely a bit complex to add in that support. Let me think about how to potentially implement it in a way that plugs in as seamlessly as possible to your current framework. If it seems doable, I'll make it a PR.

Regarding point 4, no I'm not really sure that it's coming from that step. I was just looking for anywhere that memory usage could potentially be trimmed. If it remains an issue, I can benchmark it to find the bottleneck and get back to you.

Thanks in advance for taking care of the other stuff and for all of your work on this.

jacobpwagner commented 3 years ago

The memory usage is still pretty limiting for reasonably large CyTOF sets. This probably arises in part from the data being combined in to a single large expression matrix for all files, which in later steps is then repeatedly read back in to memory in its entirety, just to be split back out for operations on file-specific subsets.

For example, here the entire concatenated matrix is read in to memory, split and scaled by file subset (without the need for information from the other subsets), then re-concatenated and re-saved. If the RHS of each of those operations induces a temporary copy of the data, that is a hefty amount of extra in-memory storage. Couldn't the subsets just be kept as separate files, then loaded, scaled, and re-written separately? Even for regression/model fitting, only the events from a single file (for fitting a model for a single Infinity marker) at a time are needed, correct? I guess I'm not seeing where you absolutely need to have the full matrix of all files in memory at the same time.

Relatedly, currently the input_events_downsampling restricts the data available for both training and prediction, correct? That is, num predicted events = min(input_events_downsampling, output_events_downsampling). Would it be possible to train models on a subset, then just use the trained models to predict events for all events for all files? So you could have something like training_events_downsampling and prediction_events_downsampling, without training_events_downsampling restricting the number of events to which the models are applied?