Closed frankligy closed 1 year ago
Hi @frankligy
sorry for a slow response. Our reconstruction accuracy plot on Slide-seq v2 is similar https://github.com/vitkl/cell2location_paper/blob/master/notebooks/mouse_brain_slide_seq_v2/cell2location_pyro_slideseq_v3.ipynb - however, I do see that your data has less detected RNA (lower detection efficiency or not sequenced deeply enough).
Cell2location doesn't predict cell proportion but cell abundance adata.obsm['q05_cell_abundance_w_sf']
, informed by N_cells_per_location
. Did you use these values or computed proportions by normalising by sum per location?
Do you see reasonable spatial distribution of the cell types? Do all cell types have values of 0.01 everywhere or are some cell types "mapped"?
What did you provide as batch_key
?
How did you train the model? All data in one GPU and model.train(..., batch_size=None)
? If the answer is yes, then it could mean that as we suspected, when data quality per location decreases, you need to pool information across neighboring locations - not just across cell types as done in the current cell2location model.
I can suggest:
Hi @vitkl,
Thanks so much for getting back to me!
Regarding the cell type abundance, yes I got the absolute abundance from adata.obsm['q05_cell_abundance_w_sf']
and it actually raises another question. If these values are informed by N_cells_per_location
, which I set as 1 per spot, why are the values I got is around 1.5
? Aren't they supposed to be within the range of 0-1?
And then go back to the question, yes after getting the absolute abundance, I indeed normalize them into proportion such that they sum up to 1 within each barcode. And when I say all around 0.01, I mean all cell types' relative proportions are around 0.01, just like the effects of taking a simple average, which is why I think the results are off and uninformative.
In terms of the solutions I tried on my end, I confirmed that reducing the number of cell types to 19 actually yield decent results, so this is currently what we are going to move forward with. (less complicated reference).
I also want to mention, as a test I did before asking this question, I actually use a hard-coded signature from our 100-cell type reference that we know has distinctive gene markers for each cell type (as opposed to using the default NB model), but the deconvolution results are still not good. Given the successful run of 19 cell types signatures, this prompted me to think about why the model will generate less optimal results as the number of cell types goes up. Is it because the increasing number of parameters results in convergence issues in the model when solving that posterior?
If these values are informed by N_cells_per_location, which I set as 1 per spot, why are the values I got is around 1.5
N_cells_per_location is not a hard constraint but a guide that encourages the model to estimate cell abundance similar to this number. The absolute cell abundance estimation can be disrupted by strong technical variation within a slide - a issue which we saw with Slide-seq data before.
In your example, it looks like the estimation of cell abundance generally failed - all cell types seem to have exactly the same value for all locations. Just to be safe, I would check if there are any gene swap/randomisation issues with 100 cell type signatures.
Is it because the increasing number of parameters results in convergence issues in the model when solving that posterior?
Yes, this could mean that there is not enough information in your Slide-Seq sample to distinguish all 100 subtypes.
One potential issue could be that N_cells_per_location=1
provides a strong regularisation (expecting very imbalanced distribution across cell types) - so you could try N_cells_per_location=5
or N_cells_per_location=10
and see what happens.
Hello, thanks so much for developing this awesome package, I used Cell2location before for my 10x visium data with the reference containing around 10 major cell types, which worked pretty fine. I recently have a task where I need to deconvolve a slide-seq data (resolution is way much high), whereas the scRNA reference is over 100 cell types (with relatively clear marker gene expressions to separate them). Following a similar workflow with some minor tweaks (
N_cells_per_location=1
, other parameters are the same as shown in the official tutorial, I useddetection_alpha=20
as I did see inter-spot variability in the slide) based on a few papers (https://www.nature.com/articles/s44161-022-00138-1, and the cell2location paper when analyzing slide-seq), but the reconstruction QC plot (deconvolution step) looks off and the cell2location predicted proportion is all around 0.01 (thinking 1 divided by 100 cell types).I am thinking about what might be the issue here. First of all, it seems that the read counts for spatial data are low, and probably only contain a few discrete values (count=0, 1, 2, 3,..). Second, the number of cell types in the scRNA may be too large (over 100 cell types). With that, I wonder if my interpretation is correct, and would you happen to have any recommendations for tweaking the program a bit to better suit this task?
N_cells_per_location
anddetection_alpha
.batch_key
for reference NB regression.