BayraktarLab / cell2location

Comprehensive mapping of tissue cell architecture via integrated single cell and spatial transcriptomics (cell2location model)
https://cell2location.readthedocs.io/en/latest/
Apache License 2.0
324 stars 58 forks source link

Cell2location on Slide-seq data + over 100 cell types in my single-cell reference #207

Closed frankligy closed 1 year ago

frankligy commented 2 years ago

Hello, thanks so much for developing this awesome package, I used Cell2location before for my 10x visium data with the reference containing around 10 major cell types, which worked pretty fine. I recently have a task where I need to deconvolve a slide-seq data (resolution is way much high), whereas the scRNA reference is over 100 cell types (with relatively clear marker gene expressions to separate them). Following a similar workflow with some minor tweaks (N_cells_per_location=1, other parameters are the same as shown in the official tutorial, I used detection_alpha=20 as I did see inter-spot variability in the slide) based on a few papers (https://www.nature.com/articles/s44161-022-00138-1, and the cell2location paper when analyzing slide-seq), but the reconstruction QC plot (deconvolution step) looks off and the cell2location predicted proportion is all around 0.01 (thinking 1 divided by 100 cell types).

Screen Shot 2022-10-19 at 1 19 34 PM

I am thinking about what might be the issue here. First of all, it seems that the read counts for spatial data are low, and probably only contain a few discrete values (count=0, 1, 2, 3,..). Second, the number of cell types in the scRNA may be too large (over 100 cell types). With that, I wonder if my interpretation is correct, and would you happen to have any recommendations for tweaking the program a bit to better suit this task?

vitkl commented 2 years ago

Hi @frankligy

sorry for a slow response. Our reconstruction accuracy plot on Slide-seq v2 is similar https://github.com/vitkl/cell2location_paper/blob/master/notebooks/mouse_brain_slide_seq_v2/cell2location_pyro_slideseq_v3.ipynb - however, I do see that your data has less detected RNA (lower detection efficiency or not sequenced deeply enough). image

Cell2location doesn't predict cell proportion but cell abundance adata.obsm['q05_cell_abundance_w_sf'], informed by N_cells_per_location. Did you use these values or computed proportions by normalising by sum per location?

Do you see reasonable spatial distribution of the cell types? Do all cell types have values of 0.01 everywhere or are some cell types "mapped"?

What did you provide as batch_key?

How did you train the model? All data in one GPU and model.train(..., batch_size=None)? If the answer is yes, then it could mean that as we suspected, when data quality per location decreases, you need to pool information across neighboring locations - not just across cell types as done in the current cell2location model.

I can suggest:

  1. Sequencing deeper if sequencing rather than RNA detection is an issue.
  2. Aggregate data for proximal locations (e.g. compute spatial KNN, take one in ten locations, aggregate RNA counts for all adjacent 9 locations).
  3. Try using 30-40 cell types just to see what happens. In general, 10 cell types is rather low - but it could be a good number for very homogenous tissues. Which tissue are you looking at?
  4. Was 10x Visium from the same tissue? Try 100 cell types on 10x Visium
frankligy commented 2 years ago

Hi @vitkl,

Thanks so much for getting back to me!

Regarding the cell type abundance, yes I got the absolute abundance from adata.obsm['q05_cell_abundance_w_sf'] and it actually raises another question. If these values are informed by N_cells_per_location, which I set as 1 per spot, why are the values I got is around 1.5

Screen Shot 2022-10-28 at 2 00 57 PM

? Aren't they supposed to be within the range of 0-1?

And then go back to the question, yes after getting the absolute abundance, I indeed normalize them into proportion such that they sum up to 1 within each barcode. And when I say all around 0.01, I mean all cell types' relative proportions are around 0.01, just like the effects of taking a simple average, which is why I think the results are off and uninformative.

In terms of the solutions I tried on my end, I confirmed that reducing the number of cell types to 19 actually yield decent results, so this is currently what we are going to move forward with. (less complicated reference).

I also want to mention, as a test I did before asking this question, I actually use a hard-coded signature from our 100-cell type reference that we know has distinctive gene markers for each cell type (as opposed to using the default NB model), but the deconvolution results are still not good. Given the successful run of 19 cell types signatures, this prompted me to think about why the model will generate less optimal results as the number of cell types goes up. Is it because the increasing number of parameters results in convergence issues in the model when solving that posterior?

vitkl commented 1 year ago

If these values are informed by N_cells_per_location, which I set as 1 per spot, why are the values I got is around 1.5

N_cells_per_location is not a hard constraint but a guide that encourages the model to estimate cell abundance similar to this number. The absolute cell abundance estimation can be disrupted by strong technical variation within a slide - a issue which we saw with Slide-seq data before.

In your example, it looks like the estimation of cell abundance generally failed - all cell types seem to have exactly the same value for all locations. Just to be safe, I would check if there are any gene swap/randomisation issues with 100 cell type signatures.

Is it because the increasing number of parameters results in convergence issues in the model when solving that posterior?

Yes, this could mean that there is not enough information in your Slide-Seq sample to distinguish all 100 subtypes.

One potential issue could be that N_cells_per_location=1 provides a strong regularisation (expecting very imbalanced distribution across cell types) - so you could try N_cells_per_location=5 or N_cells_per_location=10 and see what happens.