Cellular-Semantics / CL_KG

Building a Cell Ontology Knowledge-Base from data, and LLMs
Apache License 2.0
0 stars 0 forks source link

Calculating transcriptomic variance as a measure of granularity for cell types #2

Open dosumis opened 4 months ago

dosumis commented 4 months ago

Status: Draft Note: Putting this here as a place to park a potentially promising idea

Background - we have no objective measures of granularity in the Cell Ontology, although we are sometimes asked for this, especially in the context of single cell genomics analysis.

One way to calculate this would be to calculate transcriptomic variance across annotated data in CxG - using closure to generate matrices pulled from CxG Census. Potential way to do this.

import numpy as np
from scipy.stats import entropy

# Assuming `expression_matrix` is your scRNA-seq data matrix (genes x cells)
# Normalize the expression data to get probability distributions
probability_matrix = expression_matrix / np.sum(expression_matrix, axis=1, keepdims=True)

# Calculate Shannon entropy for each gene
gene_entropies = np.apply_along_axis(entropy, 1, probability_matrix)

# `gene_entropies` now contains the Shannon entropy for each gene

# we can then calculate the meta-entropy for the whole matrix: 

meta_entropy = entropy(gene_entropies)

Alternatively we could use the median of all entropies. Someone much more expert than me would need to comment on which might be more appropriate.

dosumis commented 4 months ago

@AvolaAmg - does this make sense to you? The background is that we can use CxG Census to generate matrices of, for example, everything annotated with T-Cell and it's subtypes. As we go up the CL classification hierarchy, I would expect meta entropy or median entropy to increase.

AvolaAmg commented 4 months ago

yes it does. I will have to document myself a bit more on the meta or median and which one would be better to use, same regarding CxG Census and I will add some ideas here.