AllenInstitute / MOp_taxonomies_ontology

Central location for versioning and sharing of taxonomy files relevant for ontology development as part of cell type cards.
Other
2 stars 0 forks source link

similarity scores only on leaf nodes #19

Open shawntanzk opened 2 years ago

shawntanzk commented 2 years ago

see https://github.com/obophenotype/brain_data_standards_ontologies/pull/281 It seems that the similarity scores are only on leaf nodes, we can represent these scores in homology relation, however there are also many homology relations that arent on leaf nodes - wondering if we had scores for those too?

jeremymiller commented 2 years ago

I think (and maybe @raymond-sanchez can confirm?) that these scores came from a confusion matrix, which compares each cell type in one species to each cell type in another. If this is true the we should be able to calculate scores for the one-few or few-few relationships by summing the scores. If someone can provide to me the raw data file that these numbers were generated from, I can see if I can figure it out.

raymond-sanchez commented 2 years ago

That's correct - Nik calculated the scores only for leaf nodes within the same class between species (Glut-Glut, GABA-GABA, etc.), but not between subclasses, classes (I think both of these were assumed to essentially correlate 1-to-1) or intermediate nodes. @jeremymiller relevant files below, let me know how I can help!

Scores https://raw.githubusercontent.com/AllenInstitute/MOp_taxonomies_ontology/main/mouseMOp_CCN202002013/Mouse_CrossSpecies_Similarity.csv https://raw.githubusercontent.com/AllenInstitute/MOp_taxonomies_ontology/main/humanM1_CCN201912131/Human_CrossSpecies_Similarity.csv https://raw.githubusercontent.com/AllenInstitute/MOp_taxonomies_ontology/main/marmosetM1_CCN201912132/Marmoset_CrossSpecies_Similarity.csv

Code to generate scores (including directions to relevant raw data files): https://github.com/AllenInstitute/celltype_cards_contenthub/blob/main/all_code/cross_species_heatmaps/input%20files/nik_script.R

shawntanzk commented 2 years ago

I'm a bit confused, how are the homology in things like lamp5-like C2, which is an intermediary node but cross-species, calculated/determined to be a homology node if only leaf nodes are calculated? Screenshot 2022-05-17 at 17 28 57

raymond-sanchez commented 2 years ago

I think those must have been determined in a separate analysis to the one I'm pointing to above, which Nik did mainly for cell type cards. Let me do some digging and get back to you

jeremymiller commented 2 years ago

Okay, these are scores based on distance matrices and not a confusion matrix, which means we cannot directly sum values in the way I said above. I don't actually know how we'd calculate similarities this way for the other nodes, if it's even possible. I would suggest removing these values for now (or I suppose you could leave incomplete as is). Ray: you are correct about it being a separate analysis. There were two strategies used to define cross-species homologies, which is a bit confusing. We might need to bring Trygve into this discussion if this is critical, but I'm going to vote for removing this value for now again.

raymond-sanchez commented 2 years ago

Ah sorry, yes I think the original analysis was done with confusion matrices, but this one for cell type cards was Euclidean distances. That sounds good, I'd be okay with removing or leaving incomplete the values that we cannot generate these same scores for.

shawntanzk commented 2 years ago

is the leaf node scores (the tsv files @raymond-sanchez stated in the comment above) safe to use? We would like to include examples of how we can annotate confidence in the ontology for the paper but we defs dont want to use anything that isn't accurate.

raymond-sanchez commented 2 years ago

I think they're fine to use for this purpose, but Jeremy let me know if you think otherwise. Nik calculated those scores and told me that they were "a clear and more accurate representation of the data" but we could also double check with Trygve if we want to get another look.

jeremymiller commented 2 years ago

They are accurate (e.g., higher is better in a quantitative way) and can be used. Moving forward (and maybe as a topic in workshop #4?) we'll want to think about a more general metric for cell type comparisons within and between taxonomies and how those can be used in an ontology.

shawntanzk commented 2 years ago

Thanks for all the information, that was super useful. We have decided not to add any homology scores for now - I think we do want to in the end, and that might involve a discussion with Trygve, but for now we will leave it out as we want to finalise the manuscript. We will instead add it in as a discussion/challenges point (like discussion how we represent confidence, and should it be specific to dataset etc) and maybe say that it will be in included in a future release. I will keep this ticket open till we figure it out just so we don't forget about it :) thanks!