Open cmungall opened 4 years ago
It would be great if we can sort this data modeling issue. In SPOKE, "AnatomyCellType" nodes have got a heterogenous ID. For example, we have the SPOKE ID "UBERON:0001155/CL:0000111" for "peripheral nerve/ganglion in colon" node. So, we are not sure, how to formulate a CURIE to feed into the node normalization service to fetch the preferred Biolink ID.
In addition to the above query, it would be great if you can let us know if there is a way to map the following 4 node types to Biolink entity. Currently I do not find an appropriate biolink mapping for these node types. (1) SideEffect (2) Food (3) PharmacologicClass (4) Nutrient
I see. What you are describing is what is called post-composition or post-coordination.
This is often confusing because often the ontology will pre-compose the term you need. For example, the original example was respiratory epithelial cell in the bronchus. This is already coordinated in CL as http://purl.obolibrary.org/obo/CL_0002328, and if you load the ontology KG you will see the part-of to bronchus and the is-a (transitive) to respiratory epithelial cell.
If you don't have the concept pre-composed then we can easily make one quickly for you. In fact we have easily set up a workflow for you whereby you give us a 2 column TSV and we give a CL ID, either existing or newly created.
However, there may still be scenarios where you want to post-compose (CL->anatomy or otherwise). Here is what I recommend:
There is a lot of literature on this topic, I will summarize some of it here later. In general I would recommend pre-composition for your scenario
Currently I do not find an appropriate biolink mapping for these node types.
We may want to make specific tickets
(1) SideEffect
Would PhenotypicFeature work here?
(2) Food
See #248
(3) PharmacologicClass (4) Nutrient
I'll get to these later, we have ongoing discussion about roles
I see. What you are describing is what is called post-composition or post-coordination.
This is often confusing because often the ontology will pre-compose the term you need. For example, the original example was respiratory epithelial cell in the bronchus. This is already coordinated in CL as http://purl.obolibrary.org/obo/CL_0002328, and if you load the ontology KG you will see the part-of to bronchus and the is-a (transitive) to respiratory epithelial cell.
If you don't have the concept pre-composed then we can easily make one quickly for you. In fact we have easily set up a workflow for you whereby you give us a 2 column TSV and we give a CL ID, either existing or newly created.
However, there may still be scenarios where you want to post-compose (CL->anatomy or otherwise). Here is what I recommend:
- this is still a cell type just like any other cell type in CL, so just classify as bl:cell
- create a bl:subClassOf link to the core CL concept (e.g neuron)
- create a part-of link to the anatomical concept (e.g. colon)
- it's up to you what ID scheme you use. However, I might recommend that we do something like hash the OWL class expression (in this case 'peripheral nerve cell' AND part-of SOME colon).
There is a lot of literature on this topic, I will summarize some of it here later. In general I would recommend pre-composition for your scenario
Thank you Chris for the reply. This is indeed helpful. Pre-composition seems to be pretty elegant solution. I got two queries about the pre-composition workflow that you mentioned.
(1) Firstly, the two column tsv that you mentioned, I presume, each column is an ID of a particular class. For example, in the case of the example "UBERON:0001155/CL:0000111", one column will be ""UBERON:0001155" and the other column will be "CL:0000111". Please correct me if this is not the case.
(2) Where can I pursue the workflow that you mentioned? Is it some sort of pre-composition service or any python code?
It would be great if you can clarify these.
Thank you.
I can coordinate getting the cell types pre-composed in CL. Any format will do. If you have a list already I'll take a look to see if they are appropriate for pre-composition. There are no hard and fast rules here. But generally "kidney macrophage" is a reasonable term, "epithelial cell of left pinky" is not
Thanks Chris for the reply. I haven't made the list now, but I can make it and am more than happy to send it to you. We were thinking of doing this pre-composition process to "AnatomyCelllType" SPOKE node during the weekly update of SPOKE. In that case, can you please let us know if there is a way to automate the pre-composition process?
Does anyone know if there is a way to automate the pre-composition process?
As per the discussion we had during the data modelling meeting (happened on Sept-09-2021), I am hereby stating the broader context where we use "AnatomyCelllType" node type in our knowledge graph. We connect "AnatomyCelllType" with "Gene" using "AnatomyCellType-expresses-Gene" edge type. This edge represents the gene expression in a specific cell type in a specific tissue from the Human Protein Atlas. Apart from that, we also connect "AnatomyCelllType" to their respective "Anatomy" and "CellType" nodes (using "AnatomyCellType-isin-Anatomy" and "AnatomyCellType-isin-CellType").
We discussed this today during the Help Desk call.
@karthiksoman - any chance you have a file for this yet?
Hi Sierra. I am hereby attaching AnatomyCelltype file from SPOKE to post compose to Cell Ontology as we discussed. Please let me know if there is anything else that needs to be done from our side SPOKE_AnatomyCelltype_file_to_postcompose_Nov_18_2021.csv .
@karthiksoman - Have we done the resolution necessary here? Or are you expecting more terms to be added/submitted?
@sierra-moxon Thanks for the reminder. Last update that was made from my end was consolidating all the Anatomy-Cell type nodes in SPOKE and shared with you, so that it could be post composed to Cell Ontology. From my end, there are no further additions. May I know if that is now post-composed in Cell Ontology?
Hi all. I want to revisit this decision to pre-compose anatomy-specific cell type classes in light of recent work toward capturing more statement semantics in qualifiers. One of the guiding principles we established for this work was to not create dependencies on external ontologies when a concept can be representing using post-composition using subject/object qualifiers. For example, we decided (with Chris's recommendation) that we would represent exposures to some entity using the pattern Entity (qualifier:'Exposure'), rather than use or submit the term to an ontology like ECTO. Same for things like 'severe bleeding' (post compose, rather than request term from HP), or 'late stage ebola' (post-composed, rather than submit term request to a disease ontology). Similarly, the pattern that was proposed for anatomy-specific cell types was to use the existing CL term for the anatomy-agnostic cell type as the S/O node IRI, and use a subject_location qualifier to capture an Uberon term indicating the anatomical context of this cell type - e.g. Macrophage (qualifier:kidney).
@cmungall Your recommendation here to add pre-composed terms to CL and use this as subject node IRIs seems to go against this principle. Perhaps you see something different about the anatomy-specific cell type use case that makes pre-composing acceptable here (e.g. ease of getting into the relevant ontology, where there is precedent for these types of classes). But I just want to be careful that we are as principled and consistent as possible in how we represent a given type of semantic in our models. For example, we have already encountered other cases where we want to constrain the anatomical context of other types of entities (biological processes , molecular activities, medical procedures), and here we resorted to post-composing (as it was deemed not appropriate to create anatomy-specific classes in an existing ontology in these cases). What are the implications if we allow for anatomical context to be pre-composed for cell types, but not other types of entities/concepts? Having SPOKE do this means that pre-composition becomes the standard way to represent anatomy-specific cell types - so other data providers need to also go through the process of requesting terms from CL and waiting form IRIs to be minted.
I am happy to revisit our earlier principles/decisions - and consider if we might allow for an approach where we use pre-composition in some cases but not others - if we can find some principled rationale to guide such decisions. Maybe this is something we want to discuss on a DM call soon - as we are collecting more and more use cases where there is a potential for pre-composition - and we need to establish some clear rules for when to do this. And also consider how we might create/use tools to translate between pre- and post-composed representations - if this provides a solution to the problem.
Thoughts @cmungall @sierra-moxon @mikebada?
@karthiksoman - I'm looking at some of the pairs
In many cases the anatomical qualifier is redundant, an OWL reasoner tells us the composed concept is equivalent to the CL type:
UBERON:0000473,testis,CL:0000178,Leydig cell
UBERON:0000473,testis,CL:0000216,Sertoli cell
Same is true for:
UBERON:0000970,eye,CL:0000575,corneal epithelial cell UBERON:0000970,eye,CL:0000142,vitreous cell UBERON:0000970,eye,CL:0002224,lens epithelial cell UBERON:0000970,eye,CL:0011004,lens fiber cell UBERON:0002370,thymus,CL:0000883,thymic cortical macrophage UBERON:0000966,retina,CL:0000740,retinal ganglion cell
For this one:
UBERON:0000966,retina,CL:0000149,pigment cell
I would avoid using very general functional terms like "pigment cell".
I think this is the pre-composed concept you want:
In contrast, an OWL reasoner would tell us that this is unsatisfiable:
UBERON:0002370,thymus,CL:0000336,medullary chromaffin cell of adrenal gland
As I believe thymus and adrenal gland are spatially disjoint (though functionally related)
There are a whole host of terms like this:
UBERON:0000002,uterine cervix,CL:1001586,mammary gland glandular cell UBERON:0001155,colon,CL:1001586,mammary gland glandular cell UBERON:0002114,duodenum,CL:1001586,mammary gland glandular cell UBERON:0001295,endometrium,CL:1001586,mammary gland glandular cell UBERON:0001301,epididymis,CL:1001586,mammary gland glandular cell UBERON:0003889,fallopian tube,CL:1001586,mammary gland glandular cell UBERON:0002110,gall bladder,CL:1001586,mammary gland glandular cell UBERON:0001132,parathyroid gland,CL:1001586,mammary gland glandular cell UBERON:0002367,prostate gland,CL:1001586,mammary gland glandular cell UBERON:0001052,rectum,CL:1001586,mammary gland glandular cell UBERON:0001829,major salivary gland,CL:1001586,mammary gland glandular cell UBERON:0000998,seminal vesicle,CL:1001586,mammary gland glandular cell UBERON:0002108,small intestine,CL:1001586,mammary gland glandular cell UBERON:0000945,stomach,CL:1001586,mammary gland glandular cell UBERON:0002046,thyroid gland,CL:1001586,mammary gland glandular cell UBERON:0002369,adrenal gland,CL:1001586,mammary gland glandular cell UBERON:0001154,vermiform appendix,CL:1001586,mammary gland glandular cell
@cmungall Thanks for this details. Yes, I understand there exists redundancy in the Anatomy qualifier. So, are you suggesting that we could leave such redundant entities as they are now (for e.g. UBERON:0000473,testis,CL:0000216,Sertoli cell) and pre-compose only the others (for e.g. UBERON:0001154,vermiform appendix,CL:1001586,mammary gland glandular cell)?
What I am saying is that for this one:
UBERON:0000473,testis,CL:0000216,Sertoli cell
You can just use the node CL:0000216 in your graph. it is already part of the testis, you get this edge when you bring in CL
This one:
UBERON:0001154,vermiform appendix,CL:1001586,mammary gland glandular cell
doesn't make sense unless we are talking about some kind of metastatic cell
@cmungall Got it Chris. Regarding the second example (appendix and mammary gland glandular cell), when I checked SPOKE, it showed the name for that AnatomyCellType node as "glandular cells in appendix" with the identifier as "UBERON:0001154/CL:1001586". However, CL:1001586 corresponds to "mammary gland glandular cell" in Cell Ontology. So, in this example I think, instead of providing the CL id for "appendix glandular cell", it might have given the CL id for "mammary gland glandular cell". (Then I think the question arises, if there is a CL id for "appendix glandular cell" then why do we need a separate AnatomyCelltype node for that entity?). I can bring this mismatch during our internal SPOKEtech discussion. I see that, you have given a bunch of such examples above. Thank you for pointing this out. It would be great if you can let me know if you come across more of such scenarios and this will help us to fine tune things.
@karthiksoman - do you think you have enough info here to update the data in SPOKE accordingly, or are there anatomical entities that are still needed after Chris's comments?
On Translator Relay call, SPOKE mention difficulties in mapping some types, e.g food, cell types.
In fact we have cell types: https://biolink.github.io/biolink-model/docs/Cell
Copied from Zoom chat:
I can comment further but input requested from SPOKE team