Open alex-l-kong opened 2 weeks ago
@jranek @camisowers this is a very rough draft but provides an overview of how an AnnData conversion could look. Depending on how the FlowSOM_Python
package works for us, that could encapsulate away a lot of the AnnData
representation. Let me know your thoughts, thanks!
@srivarra
@cliu72
Relevant background
The Pixie pipeline computes and saves out a lot of extraneous files, most of which correspond to a common set of dimensions (num_cells and num_channels). This has been extremely cumbersome for a few reasons:
In light of a gradual conversion of the spatial analysis portion of the pipeline to AnnData, it would be best to think about how to utilize AnnData for a Pixie implementation.
Design overview
The implementation for Pixie will be different for pixel- and cell-level clustering.
Pixel-level clustering
Due to the nature of Pixie being trained on a subset, this will require some additional thought.
The full subsetted data can be represented as an AnnData object:
X
: the pixels x channels dataset.X
may include additional layers for different levels of normalization. Keep in mind that even subsetted datasets may be very large, so we might only be able to keep the main training data inX
for this component.var
: channel-level information (ex. names, norm_coeffs, etc.)obs
: various cell-level assignments. This could contain stuff likefov
,segmentation_label
,x
,y
, etc.uns
: pixel_thresh_val, SOM weights, color mappings, etc.The main challenge with the pixel clustering component is the need to train on a subset, then assign on the full dataset. No matter how efficient AnnData is in loading data, it would be cumbersome to load a full pixel-level dataset into memory.
The weights can still be accessed from
uns
+ the norm_coeffs fromvar
and used to normalize + assign to each FOV's full pixel dataset individually.The average channel expressions per cluster tables will require some thought, there are two possible places they could go in:
varm
: this is my first thought. Since this is an average of pixel cluster across channels, there's a correlation between the channel dimensionsuns
: since this is a high-level summary file, it may makes more sense to put here. Additionally, these average expression files are computed across the full dataset, whereasX
in the pixel-level clustering AnnData object contains just the subsetted data.Instead of saving a hacky
dict
with params to use in cell clustering, we can instead save thisPixie
AnnData object and load this into the cell clustering notebook.Cell-level clustering
Because we train and label the same dataset, AnnData representation is much easier for cell-level data.
X
: the cells x cell_expression_columns (ex. normalized pixel-level metacluster expression) dataset.X
may include additional layers for different levels of normalization. We can create this matrix by leveraging the AnnData object from the pixel-level Pixie notebook.var
: cell expression column information (ex. names)obs
: various cell-level assignments. This could contain stuff likefov
,segmentation_label
,x
,y
, etc. For cell clustering, this can also contain the actual SOM and meta-cluster assignments. In this way, we can more easily leverage AnnData'sgroupby
functionality.uns
: SOM weights, color mappings, etc.As with pixel-level clustering, we need to decide if the average cell_expression_col per cluster + weighted channel average tables should be stored in
varm
oruns
.NOTE: if the
FlowSOM_Python
repo works out for us (https://github.com/saeyslab/FlowSOM_Python), we may be able to delegate a lot of the work there.Integration with metacluster remapping step/visualization
The visualization component has caused us no shortage of issues in the past, in large part because of inconsistent coloring schemes. AnnData can be used to store information related to the metacluster remapping step to save a lot of hassle.
For now, we can simply access the data we need to pass into the remapping GUI from the AnnData objects. Implementing this will likely be a separate PR in itself.
Code mockup
An AnnData object can be created like:
For pixel-level training, we'd have to concatenate the data from all the
.feather
files together. We already do this in the existing Pixie workflow very efficiently.Here's how the preprocessing function could look like with AnnData:
For integration with the rest of Pixie, we'll need to rethink the
cluster_helpers.py
class.The bulk of functionality derives from the base
PixieSOMCluster
class, so only this will be included for demonstration:We can extrapolate this to the existing workflows to leverage SOM and meta cluster fitting and prediction.
TODO: include meta cluster remapping stage as well as post-norm visualizations
Required inputs
Same as before, these will be programmatically combined into an AnnData object
Output files
Instead of several fragmented files, the goal is to unify all of this into a single AnnData object that gets saved to a single .h5ad file.
Timeline Give a rough estimate for how long you think the project will take. In general, it's better to be too conservative rather than too optimistic.
Estimated date when a fully implemented version will be ready for review:
Early next year.
Estimated date when the finalized project will be merged in:
Early next year.