Nanostring-Biostats / InSituType

An R package for performing cell typing in SMI and other single cell data
Other
26 stars 10 forks source link

Severe overestimation of celltypes (Hypendymal) #209

Open roanvanscheppingen opened 1 month ago

roanvanscheppingen commented 1 month ago

Currently, we are experiencing extreme overrepresentation of Hypendymal cells after running InSituType. It seems to be the go-to celltype when the data is shallow. Although the data has passed QC metric of Nanostring (minimum 100 transcripts / cell), the total numbers are quite low.

We have the suspicion that, with the highest expressing markers for Hypendymal cells being Mbp, Apoe and Malat1, this biases the celltyping significantly. Mbp, Apoe and Malat1 are amongst the highest detected genes in (most) brain datasets out there. Now we have 27393 out of 48597 cells assigned 'Hypendymal', while hypendymal cells are not so prevalent in the brain and should rather be considered a more 'rare' celltype.

Another dataset with a bit higher quality, but the same brain regions, still has 5870 out of 25183 cells assigned hypendymal. The ideal test would be a dataset which would be iteratively subsetted to the higher quality cells and then the % hypendymal could be plotted.

github-actions[bot] commented 1 month ago

Thank you for contacting us about our tools! To receive assistance, kindly email support@nanostring.com with detailed information about your issue. If applicable, attach a screenshot of any encountered errors and include a copy of the modified script in Notepad. Our customer support team will help facilitate a review and resolution of the issue.

Thank you for choosing NanoString, NanoString Dev Team

roanvanscheppingen commented 1 month ago

Extra information. Numbers stated above are from the supervised clustering using the MouseBrain reference.

When performing semi-supervised clustering. semi_sup_clust <- InSituType::insitutype(x = expression_matrix_tr, neg = MeanNegativeProbes, reference_profiles = MouseBrain_profiles, n_clusts = 5:15, update_reference_profiles = FALSE, max_iters = 5)

Dataset A goes from 5870 Hypendymal to 2031 hypendymal, but 18580 cells out of ~25K now fall within 'newly' created clusters and take up the majority of the data. Extensive filtering has been performed (>35 features per cell, MeanNeg <0.1, scrublet doublet removal).

Dataset B goes to 1549 Hypendymal instead of the 27393, but also here roughly 50% of data ends up in 'new' clusters. Here filtering is almost the same, but >20 features per cell.

Removing Malat1 from the matrix before running InSituTypeML only lowers the Hypendymal cells from 27393 to 26395.

This is done on a proseg resegmented dataset, but we know that proseg doesn't really change the distribution of transcripts in cells, and we have seen up to 50% allocation into new clusters in the original flatfiles provided by Nanostring before (data not here).

Currently, this means that InSituType is giving hard to interpret results, in a setting that is supposed to be the easiest implementation, with the same system, markers and even a well profiled reference. Please let me know if I can provide you with files for investigation. See attached Flightplot below (dataset B)

image
roanvanscheppingen commented 1 month ago

I did some further investigation, since we suspect a strong link between data quality and celltyping abilities.

image

Here you see the different cell types assigned by InSituTypeML (cohorting included, proseg analysed (i'll also rerun it on non-proseg data). As discussed before, Hypendymal celltypes are severely overrepresented (20% of data), but upon further inspection we also think radial glia (903 out of 25K) and pericytes 808 out of 25K) are overrepresented.

This is further supported by a small heatmap showing a few marker genes, some of which should be more 'glial' or 'astrocyte' (glial). Ptgds, Cryab and Apod). Yet they go predominantly into oligodendrocytes. We see in general a higher expression of (most) genes in oligodendrocytes anyway.

image

To further confirm this, I took a publicly available dataset from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM8199188 , which is of impressive quality. I ran InSituType ML (no cohorts), then I reduced the expression matrix by a factor 10. It is now in the same range as our own dataset (which passed Nanostring QC and is even filtered more stringently) expression_matrix_tr <- expression_matrix / 10 I rounded the matrix due to integers expression_matrix_tr <- round(expression_matrix_tr) And reran InSituType with the same settings. Left is celltypes by frequency of 1500 transcripts / cell (mean), right is the downscaled matrix.

image

This dataset has 22K cells, yet running it on the ''lower quality'' impacts celltyping. Hypendymal went from 75 (not shown) to 395 cells. Microglia halved. Also Myelin Forming Oligodendrocytes doubles from 1800 to approx 3500. Although I do think that some cell type switches might be within the same 'group' of cells (e.g. CA1/CA2) and that the subtypes might be very similar, I am currently not sure how to proceed with InSituType and would like to see some clarification on the variability or robustness when using InSituType.

patrickjdanaher commented 1 month ago

Great investigation, and thanks for the thorough report. I don't have a tidy solution, but I do have some thoughts:

Specific to your results above:

roanvanscheppingen commented 1 month ago

Thanks for getting back to this! Let me answer point by point.

Currently, we are considering whether the quality is actually high enough to proceed... We do see some separation on UMAP of the 'supergroups' and even of celltypes, but not the 'perfect' islands other UMAPs of different Cosmx experiments can produce. However, if we can't reliably call 50% of the data, then what are we left with? The hypendymals do not show clear separation based on nCountRNA I would say.

image image

Here the supervised flightpath (on non-proseg data to rule out the confounding of Proseg being of influence here), again 50% hypendymal.

image

In the downsampled analysis the neg by division of 10 indeed.

patrickjdanaher commented 1 month ago

Thanks - very informative.

One more QC would be to look at the mean expression profile of the false hypendymal cells - I'll speculate that it will reveal a flat profile with no obvious markers, which is a hard case to overcome.

One last thought: the UMAPs you show above are among the least differentiated I've seen in CosMx data. I'm recommend working with Support to improve the signal you're getting. With some troubleshooting, I think you should expect far better-differentiated single cell profiles from future datasets.

roanvanscheppingen commented 1 month ago

Thank you for the discussion.

A few more points of consideration.

Celltyping without the hypendymal column just shifts the problem to a different celltype (Olfactory ensheathing).

Currently, we will invest time to celltype according to the M&M from this paper https://www.cell.com/cell-reports/fulltext/S2211-1247(24)00544-8#secsectitle0075