Teichlab / bin2cell

Join subcellular Visium HD bins into cells
MIT License
15 stars 0 forks source link

Adjusting for spatial connectivities for CellTypist cell type prediction? #1

Closed Rafael-Silva-Oliveira closed 3 days ago

Rafael-Silva-Oliveira commented 6 days ago

Hello!

First, I'd like to congratulate the team for the great paper - seems very promising! I will be giving it a test this week and I'd like to ask if adjusting the raw count matrix for Visium HD using the spatial connectivities (calculated from squidpy or liana, for example) would perhaps give better results on the CellTypist prediction? I ask this because although you are collecting some spatial information from the tissue morphology, it would be nice to see some sort of spatial information being passed onto the cell type prediction.

Thank you!

nadavyayon commented 4 days ago

Hi!, Thank you very much for testing B2C!

Could you elaborate on what you are suggesting? are you looking to solve potential homogeneity issues? Happy to discuss further

Rafael-Silva-Oliveira commented 4 days ago

Hi!, Thank you very much for testing B2C!

Could you elaborate on what you are suggesting? are you looking to solve potential homogeneity issues? Happy to discuss further

Hey, thank you for making such a neat tool!

I was just thinking of ways to add some spatial information to the count matrix, and I've tried adjusting the raw count matrix with the spatial connectivities (using 6 neighbors), just to "increase" some of the values for genes that might be markers for a given cell type, hoping that this will help the model (CellTypist) predict the cell type better

For example, the expression of some genes with the matrix adjusted with spatial context (spatial connectivities): notion_1

and without: notion_2

You can see that with the adjustment, the values become more "highlighted" and maybe giving the model this "spatial context" would help with the predictions

nadavyayon commented 4 days ago

Hey! So first of all you are welcome to try and see if this improves predictions (we are working on the other issues you raised ;) ).

Just to be clear, are you adding to each 2um spot the counts of it's neighboring 6 spots?

If so then this will act as a filter and will also inflict more noise from neighboring cells.

Rafael-Silva-Oliveira commented 4 days ago

Hey! So first of all you are welcome to try and see if this improves predictions (we are working on the other issues you raised ;) ).

Just to be clear, are you adding to each 2um spot the counts of it's neighboring 6 spots?

If so then this will act as a filter and will also inflict more noise from neighboring cells.

I'm doing the dot product of the raw count matrix with the connectivities calculated as per this method: https://squidpy.readthedocs.io/en/stable/api/squidpy.gr.spatial_neighbors.html

I'm doing this with the assumption that spots (bins) around a given cell type will be more likely to be of the same cell type (e.g. cell adhesion processes, etc). So, adjusting the count matrix with these connectivities could help view some of the genes that can act as makers for a given cell type and possibly help with the predictions, but I'm also not entirely sure on this, just thinking out loud :)

Here's another example from the old visium technology:

Without

image

With adjustment

image

But you're right that it could also add more noise from neighboring cells

nadavyayon commented 4 days ago

I see!, so indeed my intuition is that this would not be beneficial for the reasons you plotted very nicely above, it acts as a spatial filter, in fact it's very similar to the binning strategy of 10X with 8um (4x4) bins and that is suboptimal for all the cases we tested. While your approach is better as you don't loose the original resolution but in the end it's a filter and filters are good for some thigs and bad for other. But again this is worth a shot! maybe with KNN=4 as this is a square grid as opposed to Visium SD that is hexagonal

Rafael-Silva-Oliveira commented 4 days ago

I see!, so indeed my intuition is that this would not be beneficial for the reasons you plotted very nicely above, it acts as a spatial filter, in fact it's very similar to the binning strategy of 10X with 8um (4x4) bins and that is suboptimal for all the cases we tested. While your approach is better as you don't loose the original resolution but in the end it's a filter and filters are good for some thigs and bad for other. But again this is worth a shot! maybe with KNN=4 as this is a square grid as opposed to Visium SD that is hexagonal

Sounds good!

I was testing the predictions before and after:

With count matrix adjusted for the connectivities: image

Distribution:

predicted_labels SCLC-N 157847 Basal 113718 Macrophage 63407 Neuroendocrine 46907 Fibroblast 45674 Mucinous 44145 SCLC-A 39837 Ionocyte 35562 T cell 16728 Endothelial 8961 Plasma cell 7525 B cell 7320 AEP 5125 Mast 3918 Club 3353 DC 2319 AE1 1704 Neutrophil 815 Ciliated 470 SCLC-P 85 Tuft 51

Without adjustment (just using the count matrix with the lognormalization steps required by celltypist): image

Distribution:

SCLC-N 274448 T cell 91830 SCLC-A 81964 Macrophage 77229 Fibroblast 21376 Basal 13342 Neuroendocrine 10548 Ionocyte 10187 Endothelial 6587 Mucinous 5691 B cell 2813 DC 2711 AEP 2118 Mast 1717 AE1 1298 Ciliated 616 SCLC-P 490 Plasma cell 199 Club 183 Neutrophil 117 Tuft 7

Not entirely sure how which is best to be fair :) I'm not entirely sure if the 91k T cell predictions would be correct from the raw count matrix

The adjusted predictions do seem to follow some of the expression of cell markers for cell types such as Fibroblasts. For reference, this would be the average of the lognorm for the cell markers for Fibroblasts in this dataset:

Fibroblasts_cluster Fibroblasts_Mean_LogNorm_Conn_Adj

You can see that the brown pattern on the adjusted connectivities show a more prominent pattern when compared to the expression of cell markers for fibroblasts possibly indicating that this adjustment helps celltypist predict the cell type better, when compared with the cell marker expression for a given cell type (using this as a "guide" of where we should see a given cell type); But not sure which approach would be best to use!

In essence what I'm trying to do here is to "spike up" the expression of spots/bins using the connectivities of surrounding cells, hoping that cell typist will give better cell type predictions as the model will be able to differentiate cell types better using these "spiked" expression matrix

You can see this in certain areas of the plots:

Region of cell types before adjusting for connectivities: image

Same region of cell types after adjusting for connectivities:

image

This is to follow the assumption that cell types of a given lineage, tend to co-localize/stay together, but then again, just thinking out loud

nadavyayon commented 4 days ago

So are you using celltypist directly on the 2um resolution? I'm surprised that it works honestly as the data is very sparse Your assumption is valid but again can hide cells (especially T cells which have naturally low counts) that are inside a layer of cells.

Rafael-Silva-Oliveira commented 4 days ago

So are you using celltypist directly on the 2um resolution? I'm surprised that it works honestly as the data is very sparse Your assumption is valid but again can hide cells (especially T cells which have naturally low counts) that are inside a layer of cells.

This is using the 8 micron resolution, following your tutorials :)

Indeed, it can certainly hide cell types (just like it does for T cells, reducing from 91k to 16k, but this lower value also seems a bit more reasonable)

The 2 micron resolution I used to test the B2C tool, but because I don't have yet any HE image, the IF dataset isn't great to extract the cells and make the predictions on this, but I'm hoping to test this out once I have my actual data to work with HE and test the B2C tool (and testing with the spatial context from spatial connectivities)

nadavyayon commented 4 days ago

Oh I see, so no, I would not use this on 8um bin as this is assuming too much. btw we are close to fix the issue with the IF (will update in the relevant thread) so hopefully you will be able to get the segmentations and compare.

Rafael-Silva-Oliveira commented 4 days ago

Oh I see, so no, I would not use this on 8um bin as this is assuming too much. btw we are close to fix the issue with the IF (will update in the relevant thread) so hopefully you will be able to get the segmentations and compare.

Thing is doing predictions on the 2 micron even with celltypist is almost impossible due to memory issues

adata_st, model=sc_model_sclc, majority_voting=Fal🔬 Input data has 9649299 cells and 18085 genes
🔗 Matching reference genes in the model
se
)

🧬 3511 features used for prediction
⚖️ Scaling input data
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/work/RO_src/venv/stnav_venv/lib/python3.10/site-packages/celltypist/annotate.py", line 85, in annotate
    predictions = clf.celltype(mode = mode, p_thres = p_thres)
  File "/mnt/work/RO_src/venv/stnav_venv/lib/python3.10/site-packages/celltypist/classifier.py", line 372, in celltype
    self.indata = (self.indata[:, k_x_idx] - means_) / sds_
  File "/mnt/work/RO_src/venv/stnav_venv/lib/python3.10/site-packages/scipy/sparse/_base.py", line 489, in __sub__
    return self._sub_dense(other)
  File "/mnt/work/RO_src/venv/stnav_venv/lib/python3.10/site-packages/scipy/sparse/_base.py", line 451, in _sub_dense
    return self.todense() - other
  File "/mnt/work/RO_src/venv/stnav_venv/lib/python3.10/site-packages/scipy/sparse/_base.py", line 912, in todense
    return self._ascontainer(self.toarray(order=order, out=out))
  File "/mnt/work/RO_src/venv/stnav_venv/lib/python3.10/site-packages/scipy/sparse/_compressed.py", line 1050, in toarray
    out = self._process_toarray_args(order, out)
  File "/mnt/work/RO_src/venv/stnav_venv/lib/python3.10/site-packages/scipy/sparse/_base.py", line 1267, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 126. GiB for an array with shape (9649299, 3511) and data type float32

Is there any workaround on this? Right now only the 8 micron seems to be the most viable solution for any downstream analysis (as also suggested by 10X Visium website) or using Bin2Cell for this; I might wait for the IF fix and then re-test with the B2C + CellTypist approach as it seems to be the best way to go about this, but just curious to know if CellTypist offers a batch type of prediction mode for very large datasets :)

Rafael-Silva-Oliveira commented 3 days ago

Closed as the previous commend I added is a question more so for CellTypist and it out of the scope for Bin2Cell