Marker genes in JSON file - Correlation analysis mode

mitsiaskonstanto commented 11 months ago

Hi @danielsf,

I'm creating a new thread here, so we can continue our conversation regarding the marker genes that are used during the mapping process.

I followed your instructions in #10 in order to access those marker genes.

(1) In the hierarchical analysis mode I get multiple marker gene lists that are used to discriminate between the children of each parent in the taxonomy tree, as also a 'None' element, that indicates the root of the taxonomy tree. That's in general clear. (2) In the correlation analysis mode though, I only get a 'None' element. Is that reasonable?

Cheers, Dimitris

danielsf commented 11 months ago

Hi Dimitris,

Yes, that is an expected result.

When you run correlation mapping, the first step the code takes is to flatten the taxonomy tree, i.e. reduce it to a single level so that the cell type clusters are all direct children of the root node. The marker gene lookup table is similarly flattened. There is now only one parent node in the tree ('None'), so all of the markers are attached to that node.

Cheers,

Scott

mitsiaskonstanto commented 11 months ago

Hi Scott,

Great, thank you for clarifying.

A follow-up question regarding the hierarchical mapping marker genes this time:

The JSON file includes such info:

hierarchical_mapping_json$marker_genes["CCN20230722_SUBC/CS20230722_SUBC_022"]
$`CCN20230722_SUBC/CS20230722_SUBC_022`
   [1] "ENSMUSG00000051951" "ENSMUSG00000002459" "ENSMUSG00000033774" "ENSMUSG00000033740" "ENSMUSG00000067879"
   [6] "ENSMUSG00000042501" "ENSMUSG00000048960" "ENSMUSG00000016918" "ENSMUSG00000025776" "ENSMUSG00000025931"
  [11] "ENSMUSG00000026141" "ENSMUSG00000026058" "ENSMUSG00000026077" "ENSMUSG00000050967" "ENSMUSG00000026065"
  [16] "ENSMUSG00000026062" "ENSMUSG00000045515" "ENSMUSG00000008136" "ENSMUSG00000026042" "ENSMUSG00000018417"

As I was examining the genes included in each "SUBC", I observed that a big percentage of them are constantly present in every predicted "SUBCLASS". And if I also seek for "unique" markers across subclasses, I end up with really few subclasses with some unique markers.

Nevertheless, it seems that this is not a problem for the mapper, since the results I get make sense. Though, it would be great to know how (and which of) these markers drive the diversity between these subclasses, or even supertypes. Why a subclass label is preferred against another one since they are defined by a very similar signature of marker genes? Is there a level of "importance" for each marker in each "SUBC" that the algorithm takes into consideration while choosing the labels to assign?

Thank you in advance, Dimitris

danielsf commented 11 months ago

The marker genes used by the on-line MapMyCells app are the product of another research team, so I'm going to have to ask around to see if there is an answer to your question. With the onset of the end-of-year holidays, I probably won't be able to properly respond to this until early 2024. Sorry I can't give you anything more helpful now.

danielsf commented 10 months ago

@mitsiaskonstanto

I just read over your question again and realized I can answer it. There is no importance score that the algorithm uses when assigning classes, subclasses, etc. The data is simply subsampled to include only the marker genes and then correlated against the average gene expression profiles of the clusters in the reference data (again, using only the marker genes). The cluster with the highest correlation coefficient is chosen (i.e. all marker genes are considered equal).

The documentation for the cell type assignment algorithm can now be found here.

mitsiaskonstanto commented 10 months ago

Hi Scott,

Happy new year and thank you very much for your response.

Ok, that's totally reasonable then.

I will go through the documentation you have created and let you know if everything is clear.

Cheers,

Dimitris

AllenInstitute / cell_type_mapper

Marker genes in JSON file - Correlation analysis mode #11