Irrationone / cellassign

Automated, probabilistic assignment of cell types in scRNA-seq data
Other
192 stars 81 forks source link

Implementation in scvi-tools #82

Closed adamgayoso closed 3 years ago

adamgayoso commented 3 years ago

Hi cellassign team,

We might be interested in implementing a Python version of cellassign in scvi-tools.

I'm taking a look at the EM optimization for cell assign, and I see that the M step is performed over all cells in the dataset? Do you have any experience with mini-batching (I believe in the EM literature, it's called incremental EM)? It would be easier to implement it this way in scvi-tools and I was hoping could lead to faster convergence?

Please let me know if you have any thoughts!

kieranrcampbell commented 3 years ago

Hi @adamgayoso

A python version of cellassign would be awesome, we've had endless issues with the tensorflow <-> R compatibility for installs.

We discussed mini-batching for EM at some point and don't think we went through with it (though maybe Allen can correct). Don't see any fundamental problems with it other than possibly needing better convergence monitoring since you'd then be taking noisy gradient steps.

We're also working on a model called Astir ( https://github.com/camlab-bioml/astir ) based on CellAssign with a modified likelihood and covariance structure for highly multiplexed imaging. This uses recognition nets for amortized inference rather than explicitly instantiating the variables, which could also work.

Happy to discuss more!

Kieran

adamgayoso commented 3 years ago

Hi @kieranrcampbell

We will likely go ahead and try implementing CellAssign -- we are in the middle of writing a developer-centric API to easily write new models without the overhead of defining datasets, data loaders, trainers etc. So basically for CellAssign we will just need to define the model parameters, the E step, and the likelihood objective. To get an idea if you're curious, we have implemented Stereoscope for deconvolution of spatial data using scRNA-seq data.

It seems like Astir could also benefit from our infrastructure. We are also currently trying to figure out our "contributing guidelines", for example, which models should go in scvi.external versus having developers just require our package and release their own packages, etc. I'm happy to continue this discussion off of github as well.

Irrationone commented 3 years ago

Hi @adamgayoso

With regards to mini-batching, definitely doable; had briefly tried this in the past as Kieran said but decided to not go ahead with it without more comprehensive convergence testing. Happy to hear that you guys are interested in implementing a Python version -- feel free to reach out with any questions.

adamgayoso commented 3 years ago

@wukathy has implemented something that looks to give similar predictions as your implementation!! It takes about 30 seconds to run on a GPU. We still have some things we'd like to cleanup to mirror your implementation more.

Also, is "celltype" the correct metadata key to use to evaluate consistency with this implementation? And do you know how long would these datasets take to run with full batch EM?

image

image

adamgayoso commented 3 years ago

And no issue with convergence:

image

kieranrcampbell commented 3 years ago

This looks awesome! The way we usually evaluate is by post-hoc inspection of the expression of the markers per cell type, but given the consistency I wouldn't even bother

Not sure I understand the metadata question - if it's a field you're adding, celltype makes sense. If it's a field you're expecting from the user, it could be any of celltype, cell_type, Cell type, etc etc

adamgayoso commented 3 years ago

Oh sorry, we had downloaded the data from the paper, which is being displayed above. The objects have a lot of cell-specific metadata, and appear to have the output of your implementation of CellAssign. I just wanted to confirm which one we should compare to for reproducibility.

Index(['Sample', 'dataset', 'patient', 'timepoint', 'progression_status',
       'patient_progression', 'sample_barcode', 'is_cell_control',
       'total_features_by_counts', 'log10_total_features_by_counts',
       'total_counts', 'log10_total_counts', 'pct_counts_in_top_50_features',
       'pct_counts_in_top_100_features', 'pct_counts_in_top_200_features',
       'pct_counts_in_top_500_features', 'total_features_by_counts_endogenous',
       'log10_total_features_by_counts_endogenous', 'total_counts_endogenous',
       'log10_total_counts_endogenous', 'pct_counts_endogenous',
       'pct_counts_in_top_50_features_endogenous',
       'pct_counts_in_top_100_features_endogenous',
       'pct_counts_in_top_200_features_endogenous',
       'pct_counts_in_top_500_features_endogenous',
       'total_features_by_counts_feature_control',
       'log10_total_features_by_counts_feature_control',
       'total_counts_feature_control', 'log10_total_counts_feature_control',
       'pct_counts_feature_control',
       'pct_counts_in_top_50_features_feature_control',
       'pct_counts_in_top_100_features_feature_control',
       'pct_counts_in_top_200_features_feature_control',
       'pct_counts_in_top_500_features_feature_control',
       'total_features_by_counts_mitochondrial',
       'log10_total_features_by_counts_mitochondrial',
       'total_counts_mitochondrial', 'log10_total_counts_mitochondrial',
       'pct_counts_mitochondrial',
       'pct_counts_in_top_50_features_mitochondrial',
       'pct_counts_in_top_100_features_mitochondrial',
       'pct_counts_in_top_200_features_mitochondrial',
       'pct_counts_in_top_500_features_mitochondrial',
       'total_features_by_counts_ribosomal',
       'log10_total_features_by_counts_ribosomal', 'total_counts_ribosomal',
       'log10_total_counts_ribosomal', 'pct_counts_ribosomal',
       'pct_counts_in_top_50_features_ribosomal',
       'pct_counts_in_top_100_features_ribosomal',
       'pct_counts_in_top_200_features_ribosomal',
       'pct_counts_in_top_500_features_ribosomal', 'size_factor',
       'cellassign_cluster_broad', 'cellassign_cluster_specific',
       'B.cells..broad.', 'T.cells..broad.', 'other..broad.', 'B.cells',
       'Cytotoxic.T.cells', 'CD4.T.cells', 'Tfh', 'other', 'celltype',
       'malignant_status_manual', 'celltype_full', 'G1', 'S', 'G2M',
       'Cell_Cycle', 't_seurat_cluster', 't_seurat_0.8_cluster',
       't_phenograph_cluster', 't_cluster', 'malignant_seurat_cluster',
       'malignant_seurat_0.8_cluster', 'malignant_phenograph_cluster',
       'malignant_cluster', 'b_seurat_cluster', 'b_seurat_0.8_cluster',
       'b_phenograph_cluster', 'b_cluster', 'all_seurat_cluster',
       'all_seurat_0.8_cluster', 'all_seurat_1.2_cluster', 'all_sc3_cluster',
       'all_SC3_cluster', 'all_cluster', 'all_subset_seurat_cluster',
       'all_subset_seurat_0.8_cluster', 'all_subset_seurat_1.2_cluster',
       'all_subset_cluster'],
      dtype='object')
kieranrcampbell commented 3 years ago

Gotcha! Yes celltype should be the correct one here

adamgayoso commented 3 years ago

We completed the implementation, it will be available in our next release this week. It's very consistent with the two main datasets in your manuscript, but we will add a warning that it's experimental and will continue to promote the original implementation.

We would also appreciate if there could be a link to the python implementation on this repo!