dcjones / proseg

Probabilistic cell segmentation for in situ spatial transcriptomics
Other
25 stars 0 forks source link

High fraction of cells empty - alert from xeniumranger after import #25

Open imran-aifi opened 1 month ago

imran-aifi commented 1 month ago

When running proseg on Xenium data and following the Xenium Explorer tutorial, I get a high fraction of cells empty as reported by the xeniumranger QC report:

High fraction of cells empty | 18.2% | This alert is triggered when the value is >= 10.0%. An unusually high fraction of segmented cells were found to contain no transcripts. Check the cell segmentation. Total number of cells (153000) and median transcripts per cell (23) are comparable to the original xeniumranger results using Xenium's older nuclear expansion segmentation.

I am running proseg with proseg --xenium --ncomponents 10 as the only parameters used. I'm wondering if you had recommended parameters, or guidance on whether to use --voxel-layers 1 to simplify the segmentation as we do not have serial sections for 3D spatial transcriptomics.

dcjones commented 1 month ago

I can't say for sure what's going on, but I suspect some subset of cells has extremely low coverage (the median is pretty low here). Proseg infers cell boundaries using transcripts, so it may simply be doing a poor job because it lacks the data in cases where there are only a few transcripts. It may think these are so sparse it looks like background noise.

Do you know what fraction was empty in Xenium's segmentation by comparison?

imran-aifi commented 1 month ago

Thanks for the point of reference for the low coverage! Xenium's segmentation resulted in 4.5% empty. In Xenium Explorer looking at the unassigned (presumably empty) cells, they tend to be spatially located in sections of tissue that we know are low quality (high RBC count, etc). I got some improvement after running proseg with --voxel-layers 1 where the % empty is now 10.3%, though those unassigned cells are more evenly distributed across the tissue. From what I've gleaned, the starting point for the simulation is the initial nuclei assignments from 10X - perhaps some of these nuclei are misidentified, or have few/no transcripts assigned to them (transcripts for these nuclei are assigned to neighboring cells, or treated as background)?