BayraktarLab / cell2location

Comprehensive mapping of tissue cell architecture via integrated single cell and spatial transcriptomics (cell2location model)
https://cell2location.readthedocs.io/en/latest/
Apache License 2.0
306 stars 55 forks source link

Conversion of NMF specific fator based scores to discrete groups #335

Open archanabhardwaj opened 9 months ago

archanabhardwaj commented 9 months ago

Very interesting tool!!

I have experience using cell2location for deconvolution of 10x visium spatial transcriptome data together with our in-house single cell datasets. Results have consistently been very promising. As a next step in my analysis, I am interested in performing transformation of factor-based scores into discrete groups that could be used to clearly classify spots.

This is how factor result look like :

spot_id | mean_nUMI_factorsfact_0 | mean_nUMI_factorsfact_1 | mean_nUMI_factorsfact_2 -- | -- | -- | -- A1_AAACAAGTATCTCCCA-1 | 0.105838572195757 | 0 | 0 A1_AAACAGAGCGACTCCT-1 | 0.324799348790577 | 0 | 0 A1_AAACAGCTTTCAGAAG-1 | 1.15356177790355 | 0 | 0 A1_AAACAGGGTCTATATT-1 | 0.616822748525877 | 0 | 0 A1_AAACATTTCCCGGATT-1 | 0.579633732252442 | 7.07129102030276 | 17.9217709562886 A1_AAACCGGGTAGGTACC-1 | 1.24913446970342 | 3.54084434787199 | 0.668613378416122 A1_AAACCGTTCGTCCAGG-1 | 0 | 0 | 0 A1_AAACCTAAGCAGCCGG-1 | 0.374345846808592 | 0.007036758781487 | 0 A1_AAACCTCATGAAGTTG-1 | 0.814214381154637 | 14.2101979654052 | 18.7039938972178 A1_AAACGAAGAACATACC-1 | 0.043124813915801 | 0 | 0 A1_AAACGAGACGGTTGAT-1 | 0.468739714378215 | 0.02594835560077 | 1.43807505427844 A1_AAACGGGCGTACGGGT-1 | 0.324312880713552 | 10.7817583299243 | 19.7107231867464

I would like to change NMF score to some discrete groups :

spot_id | mean_nUMI_factorsfact_0 | mean_nUMI_factorsfact_1 | mean_nUMI_factorsfact_2 -- | -- | -- | -- A1_AAACAAGTATCTCCCA-1 | 1 | 0 | 0 A1_AAACAGAGCGACTCCT-1 | 1 | 0 | 0 A1_AAACAGCTTTCAGAAG-1 | 1 | 0 | 0 A1_AAACAGGGTCTATATT-1 | 1 | 0 | 0 A1_AAACATTTCCCGGATT-1 | 0 | 2 | 0 A1_AAACCGGGTAGGTACC-1 | 0 | 2 | 0 A1_AAACCGTTCGTCCAGG-1 | 0 | 2 | 0 A1_AAACCTAAGCAGCCGG-1 | 0 | 2 | 0 A1_AAACCTCATGAAGTTG-1 | 0 | 0 | 3 A1_AAACGAAGAACATACC-1 | 0 | 0 | 3 A1_AAACGAGACGGTTGAT-1 | 0 | 0 | 3 A1_AAACGGGCGTACGGGT-1 | 0 | 0 | 3

I would appreciate all the suggestions .

kuang-da commented 9 months ago

I am also exploring a similar direction as you. I believe we can extact the NMF sample loadings from a specifc model (eg. n_fact7) as follows:

nmf_df = pd.DataFrame(adata_vis.uns['mod_coloc_n_fact7']['post_sample_means']['location_factors'])
nmf_df.index = adata_vis.uns['mod_coloc_n_fact7']['obs_names']
nmf_df.columns = adata_vis.uns['mod_coloc_n_fact7']['fact_names']
nmf_df.head()

CleanShot 2023-12-03 at 09 35 29

In some literatures (e.g. LIGER), clusters are assigned to each observation based on the highest association (i.e., loading). Do you think this method of cluster assignment is a reasonable addition to the cell2location's NMF workflow?

vitkl commented 9 months ago

The idea behind using NMF factorisation of cell abundance in Visium data is that the continuous nature of factors allows to separate spatially interlaced tissue zones. One location can capture data from several adjacent tissue zones (such as dark vs light Germinal Center zones). This is, of course, not how it works in reality - you can draw the boundary between Germinal Center zones given high enough resolution in the data. Still because of Visium data resolution you cannot say that a location contains only the dark zone when it has 70%-30% split between the two areas. In some tissues, discrete assignment of locations to one area makes sense - especially corresponding to broad rather than fine-grained anatomy (brain cortex vs thalamus vs hippocampus). In addition, there is a difference between anatomical clusters and cell relations (see Adler et al 2023 https://pubmed.ncbi.nlm.nih.gov/37607470/) - with continuous models representing mutual enrichment of cell types near each other regardless of whether the cell types in question are dominant in Visium locations.

It can be reasonable to define thresholds on each factor to identify locations that contain a given tissue zone (one location - many factors). In most cases, it is less reasonable to require 1:1 mapping between locations and factors (one location - one factor).

The code @kuang-da provided is indeed the way to get these loading.