Open roanvanscheppingen opened 1 month ago
Thank you for contacting us about our tools! To receive assistance, kindly email support@nanostring.com with detailed information about your issue. If applicable, attach a screenshot of any encountered errors and include a copy of the modified script in Notepad. Our customer support team will help facilitate a review and resolution of the issue.
Thank you for choosing NanoString, NanoString Dev Team
Extra information. Numbers stated above are from the supervised clustering using the MouseBrain reference.
When performing semi-supervised clustering.
semi_sup_clust <- InSituType::insitutype(x = expression_matrix_tr, neg = MeanNegativeProbes, reference_profiles = MouseBrain_profiles, n_clusts = 5:15, update_reference_profiles = FALSE, max_iters = 5)
Dataset A goes from 5870 Hypendymal to 2031 hypendymal, but 18580 cells out of ~25K now fall within 'newly' created clusters and take up the majority of the data. Extensive filtering has been performed (>35 features per cell, MeanNeg <0.1, scrublet doublet removal).
Dataset B goes to 1549 Hypendymal instead of the 27393, but also here roughly 50% of data ends up in 'new' clusters. Here filtering is almost the same, but >20 features per cell.
Removing Malat1 from the matrix before running InSituTypeML only lowers the Hypendymal cells from 27393 to 26395.
This is done on a proseg resegmented dataset, but we know that proseg doesn't really change the distribution of transcripts in cells, and we have seen up to 50% allocation into new clusters in the original flatfiles provided by Nanostring before (data not here).
Currently, this means that InSituType is giving hard to interpret results, in a setting that is supposed to be the easiest implementation, with the same system, markers and even a well profiled reference. Please let me know if I can provide you with files for investigation. See attached Flightplot below (dataset B)
I did some further investigation, since we suspect a strong link between data quality and celltyping abilities.
Here you see the different cell types assigned by InSituTypeML (cohorting included, proseg analysed (i'll also rerun it on non-proseg data). As discussed before, Hypendymal celltypes are severely overrepresented (20% of data), but upon further inspection we also think radial glia (903 out of 25K) and pericytes 808 out of 25K) are overrepresented.
This is further supported by a small heatmap showing a few marker genes, some of which should be more 'glial' or 'astrocyte' (glial). Ptgds, Cryab and Apod). Yet they go predominantly into oligodendrocytes. We see in general a higher expression of (most) genes in oligodendrocytes anyway.
To further confirm this, I took a publicly available dataset from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM8199188 , which is of impressive quality. I ran InSituType ML (no cohorts), then I reduced the expression matrix by a factor 10. It is now in the same range as our own dataset (which passed Nanostring QC and is even filtered more stringently)
expression_matrix_tr <- expression_matrix / 10
I rounded the matrix due to integers
expression_matrix_tr <- round(expression_matrix_tr)
And reran InSituType with the same settings. Left is celltypes by frequency of 1500 transcripts / cell (mean), right is the downscaled matrix.
This dataset has 22K cells, yet running it on the ''lower quality'' impacts celltyping. Hypendymal went from 75 (not shown) to 395 cells. Microglia halved. Also Myelin Forming Oligodendrocytes doubles from 1800 to approx 3500. Although I do think that some cell type switches might be within the same 'group' of cells (e.g. CA1/CA2) and that the subtypes might be very similar, I am currently not sure how to proceed with InSituType and would like to see some clarification on the variability or robustness when using InSituType.
Great investigation, and thanks for the thorough report. I don't have a tidy solution, but I do have some thoughts:
Specific to your results above:
Thanks for getting back to this! Let me answer point by point.
The fine grained subclassification might be hindered due to the lower quality data, but then if I would take the higher hierarchies I would end up with ependymal cells. This is still a rare celltype and not what we expect in the dataset. (We expect the hypendymals to actually be oligodendrocytes). This is also what we see when we remove the "Hypendymal" column from the reference profile (although that's adviced against I think). When we remove Hypendymal, the problem just shifts to a new cell population (radial glia)
Yes, our dataset could be described as quite "flat", and this might be partly due to senescence (it was an irradiation experiment), but we do not see senescence markers popping up massively.
In regards to overdigestion, sample prep was performed fresh and in Seattle and did pass Nanostring's QCs, however, we have little wiggle room left to subset even more stringent.
I'll try the updating of the ref profiles and rescaling (in 2.0), however the QC post cell typing is not possible. Actually the hypendymal cells are called with the highest confidence of them all...
Currently, we are considering whether the quality is actually high enough to proceed... We do see some separation on UMAP of the 'supergroups' and even of celltypes, but not the 'perfect' islands other UMAPs of different Cosmx experiments can produce. However, if we can't reliably call 50% of the data, then what are we left with? The hypendymals do not show clear separation based on nCountRNA I would say.
Here the supervised flightpath (on non-proseg data to rule out the confounding of Proseg being of influence here), again 50% hypendymal.
In the downsampled analysis the neg by division of 10 indeed.
Thanks - very informative.
One more QC would be to look at the mean expression profile of the false hypendymal cells - I'll speculate that it will reveal a flat profile with no obvious markers, which is a hard case to overcome.
One last thought: the UMAPs you show above are among the least differentiated I've seen in CosMx data. I'm recommend working with Support to improve the signal you're getting. With some troubleshooting, I think you should expect far better-differentiated single cell profiles from future datasets.
Thank you for the discussion.
A few more points of consideration.
Celltyping without the hypendymal column just shifts the problem to a different celltype (Olfactory ensheathing).
Currently, we will invest time to celltype according to the M&M from this paper https://www.cell.com/cell-reports/fulltext/S2211-1247(24)00544-8#secsectitle0075
Currently, we are experiencing extreme overrepresentation of Hypendymal cells after running InSituType. It seems to be the go-to celltype when the data is shallow. Although the data has passed QC metric of Nanostring (minimum 100 transcripts / cell), the total numbers are quite low.
We have the suspicion that, with the highest expressing markers for Hypendymal cells being Mbp, Apoe and Malat1, this biases the celltyping significantly. Mbp, Apoe and Malat1 are amongst the highest detected genes in (most) brain datasets out there. Now we have 27393 out of 48597 cells assigned 'Hypendymal', while hypendymal cells are not so prevalent in the brain and should rather be considered a more 'rare' celltype.
Another dataset with a bit higher quality, but the same brain regions, still has 5870 out of 25183 cells assigned hypendymal. The ideal test would be a dataset which would be iteratively subsetted to the higher quality cells and then the % hypendymal could be plotted.