Closed bcorrie closed 7 months ago
- [ ] Is the class an entity or a process? Is it an information entity about a material entity/process, or a pure information entity/process?
Participant is a material entity.
- [ ] Specify the ontology URI for class. Is there a hierarchy of ontology terms that are relevant?
@schristley @jamesaoverton before I go down a rabbit hole on this, since I believe the fields/slots for Particpant
that we are starting with came from IEDB, I suspect that IEDB has URI's that would cover these slots??? It doesn't make a lot of sense for me to start from scratch on these. For example, I am sure there is an accepted ontology for biological_sex
?
@jamesaoverton I am happy to copy and paste things in from an IEDB spec/definition, so I can save you that work if you can point me to such a thing...
@bcorrie Yes, we have a lot of these mappings from previous work, and we're happy to help. I'll make a public document.
For 'biological sex' we prefer PATO 'genotypic sex' http://purl.obolibrary.org/obo/PATO_0020000. For 'participant' we probably want OBI 'organism' http://purl.obolibrary.org/obo/OBI_0100026, or maybe something more specific.
I have been looking around at both LinkML and projects that use LinkML. This seems similar to our Particpant
and uses LinkML under the hood presumably.
https://cancerdhc.github.io/ccdhmodel/v1.2/Subject/
I found the link to the Cancer Research Data Commons on the LinkML site:
https://linkml.io/linkml/examples.html#introductory-example
Turns out it isn't really helpful other than to confirm that our fields are not ridiculous - but we knew that.
So they have this nice Class/Entity model with relationships, but as far as I can tell they have no links to external definitions of any kind (no external URIs are used) and they define everything internally with enums. Am I missing something? I have to say LinkML thus far as confused me more than helped.
So far this has been the best link I have found that explains how LinkML works in a practical sense:
https://biolink.github.io/biolink-model/using-the-modeling-language/
@bcorrie How helpful would it before for you if you can enter LinkML and then view the corresponding JSON schema? If you think that will help then I can send some instructions for playing with ak-schema. It is relatively straightforward with using docker.
I should have instructions in the README but I've not gotten past just creating the base repo.
@schristley I think we should work through a couple of concrete examples like the Participant
class and its fields - we being you, @jamesaoverton and I, and create a "complete" definition in the sense of coming up with and agreeing on what the columns in the spreadsheet should be and what the resultant LinkML object would look like. You are the project lead (you know what you want), James is the ontology expert, and I am the team member tasked with the job. I think if we have a "finished" class as concrete example of what is being looked for then the rest of the group will be able to move forward much more easily on the other classes.
I suggest Participant
because I think it is about as simple a class as we have, with fields that I think should have clear definitions for what the field means (field definition) as well as well defined ontologies for what the values of the field should be able to take on (the range).
I think this is important, as when I look at a field like Participant.biological_sex
I am still not clear what I should be doing 8-). If we are going to use this approach, it should be fairly trivial for me to deal with this field - and yet it isn't.
To be honest, LinkML is confusing me to no end. Why, oh why, does it have to use different terms for things (slots - really?).
And why do they overload the term "slot".
Basically everything in LinkML is a slot as far as I can tell...
Coming at this fresh from the outside, this makes their terminology a mess and their documentation confusing, at least to my tired old brain...
Fundamentally this class is simple (from here): https://linkml.io/linkml/intro/overview.html
classes:
Participant:
is_a: MaterialEntity ## parent class
description: >-
A participant in a study
class_uri: schema:Person ##???
slots:
- participant_id
- biological_sex
- race
- race_specify
- ethnicity
- geolocation
Call me done...
But what we really want is:
prefixes:
OBI: http://purl.obolibrary.org/obo/OBI_
IAO: http://purl.obolibrary.org/obo/IAO_
NCIT: http://purl.obolibrary.org/obo/NCIT_
PATO: http://purl.obolibrary.org/obo/PATO_
classes:
Participant:
is_a: MaterialEntity ## parent class
description: >-
A participant in a study
class_uri: OBI:0000026
attributes:
participant_id:
identifier: true
slot_uri: IA0:0020000
range: *
biological_sex:
slot_uri: PATO:0000047
range: NCIT:C28421
race:
slot_uri: ???
range: ???
race_specify:
slot_uri: ???
range: ???
ethnicity:
slot_uri: ???
range: ???
geolocation:
slot_uri: ???
range: ???
Where I think each of the question marks in the above should be replaced by a value from a column in the object spreadsheet...
I have matched up the spreadsheet with the above for Participant.biological_sex
.
- [ ] Specify the ontology URI for class. Is there a hierarchy of ontology terms that are relevant?
This term is actually about the role of the participant versus the material entity. I would go with organism as suggested by James.
http://purl.obolibrary.org/obo/OBI_0100026
class_uri: OBI:0100026
So they have this nice Class/Entity model with relationships, but as far as I can tell they have no links to external definitions of any kind (no external URIs are used) and they define everything internally with enums. Am I missing something? I have to say LinkML thus far as confused me more than helped.
I was confused about that as well as it seemed like all the enums where hard-coded controlled vocabulary. But then I found this Dynamic Enum. It is similar to how we coded this in the AIRR standards, specify 1) ontology (source_ontology
) and 2) the root term (source_nodes
). Though it looks more advanced in that you can specify multiple ontologies and exclude parts of the ontology tree.
I was confused about that as well as it seemed like all the enums where hard-coded controlled vocabulary. But then I found this Dynamic Enum. It is similar to how we coded this in the AIRR standards, specify 1) ontology (
source_ontology
) and 2) the root term (source_nodes
). Though it looks more advanced in that you can specify multiple ontologies and exclude parts of the ontology tree.
That is cool, but as far as I can tell Cancer Research Data Commons does not use that, they have encoded everything as Static Enums (https://cancerdhc.github.io/ccdhmodel/v1.2/#enums). Admittedly it looks like many of them are autogenerated from some other source (which I can't find). So maybe this predates the DynamicEnum?
enum_CRDCH_Subject_ethnicity:
name: enum_CRDCH_Subject_ethnicity
description: Autogenerated Enumeration for CRDC-H Subject ethnicity
comments:
- 'Name according to TCCM: "CRDC-H.Subject.ethnicity"'
code_set: https://terminology.ccdh.io/enumerations/CRDC-H.Subject.ethnicity
code_set_version: '2021-12-16T18:04:32.260053+00:00'
It would be nice if we could find this: https://terminology.ccdh.io/enumerations/CRDC-H.Subject.ethnicity
Fundamentally this class is simple (from here): https://linkml.io/linkml/intro/overview.html
classes: Participant: is_a: MaterialEntity ## parent class description: >- An organism that is a participant in a study class_uri: OBI:0100026 slots: - participant_id - biological_sex - race - race_specify - ethnicity - geolocation
Call me done...
Yep, that is good enough. The slots can be defined in each class or globally in the slots object. I would initially going with global slots in case we want to re-use them on other classes. Then iniitially give them simple ranges as Type.
slots:
biological_sex:
range: string
race:
range: string
ethnicity:
range: string
geolocation:
range: string
In the AIRR spec, biological_sex
is a controlled vocabulary enum, and both race
and ethnicity
are just strings. In AIRR, geolocation
is in an ontology so let's add that in:
top_node:
id: GAZ:00000448
label: geographic location
Change the range to be an enum
slots:
geolocation:
range: GeoLocationEnum
Now define the enumeration. Unfortunately the doc doesn't do a good job of explaining what that the different properties so I'm just copy/pasting
enums:
GeoLocationEnum:
source_ontology: OBO:GAZ
source_nodes:
- GAZ:00000448
include_self: false
relationship_types:
- rdfs:subClassOf
However, note that we are being a bit imprecise. The class_uri: OBI:0100026
is a more generic organism while our Participant
implies a human. So we'll likely want to create a hierarchy of classes, so we can subclasses with specialized slots.
classes:
Organism:
is_a: MaterialEntity ## parent class
description: >-
An organism that is a participant in a investigation
class_uri: OBI:0100026
slots:
- organism_id
- biological_sex
- geolocation
HumanParticipant:
is_a: Organism
slots:
- race
- race_specify
- ethnicity
MouseParticipant:
is_a: Organism
slots:
- strain_name
We'll need to discuss this a bit to make sure we are doing the right thing though. I imagine like with OO programming, sometimes you want to create a subclass and other times you just want a organism_type
to handle variations...
That is cool, but as far as I can tell Cancer Research Data Commons does not use that, they have encoded everything as Static Enums (https://cancerdhc.github.io/ccdhmodel/v1.2/#enums).
Yes, I'd seen and wondered that too, before I learned they were using LinkML. Now I can guess why... It is autogenerated, as that's how LinkML works. Though, are they doing it for optimization, or because dynamic enums were not available, or maybe they didn't want dynamic lookup? I don't know. Dynamic enums imply adding some code to do the dynamic lookup, so if all the enums values will fit into memory then that's fast.
However, note that we are being a bit imprecise. The
class_uri: OBI:0100026
is a more generic organism while ourParticipant
implies a human. So we'll likely want to create a hierarchy of classes, so we can subclasses with specialized slots.
A further imprecision is that "participant in an investigation" includes people who are performing the investigation like researchers and grad students, versus participants that are organisms being sampling who might also be people. I believe this means that we will want a has_role
slot.
I will wait for @jamesaoverton to provide some info on IEDB's mappings before going much further. I don't think it makes sense for me to spend time trying to figure a field out when the IEDB team has years of experience and has gone through many trials and tribulations on choosing these 8-)
I am starting to get a handle on LinkML at least.
Here is a current draft of the ImmuneSpace terminology, which should be a good starting point.
- If the class is a process, there should be slots for the inputs/outputs of the process. They may be directly defined in the class or are relations.
Not a process, checking this off.
- Define the slots for the class. Every slot should have an URI either to an ontology term or schema.org term for its semantics. Is there a hierarchy of terms that are relevant for its semantics?
Done as draft - see Google Sheet
- [ ] Define the attributes of each slot. Identifier? Required? Data type? Controlled vocabulary or ontology for the values?
Also done as first draft
- Define the direct relations to other objects. What is the cardinality of the relation? Is there an ontology URI for the relation?
I am not sure we should have this as part of the checklist for the class itself. This seems like second step that may take a long time to resolve and shouldn't necessarily hold back the definition of the class.
- [ ] Specify the ontology URI for class. Is there a hierarchy of ontology terms that are relevant?
This term is actually about the role of the participant versus the material entity. I would go with organism as suggested by James.
I am not sure about using the organism field. I believe OBI:0000097 (participant under investigation role) is more appropriate, since I believe most of what we are capturing is the characteristics about the participant and their role in the study. If I think about this class as a concept that we want to compare across repositories, the concept is of a participant that plays a role in a study.
Note we noted before that we were missing species/organism field and I have added that, but that is a "slot" that contains this information rather than the foundation of the class.
I have changed this back to OBI:0000097 for now but I key OBI:0100026 in the comments.
- [ ] Specify the ontology URI for class. Is there a hierarchy of ontology terms that are relevant?
This term is actually about the role of the participant versus the material entity. I would go with organism as suggested by James. http://purl.obolibrary.org/obo/OBI_0100026
I am not sure about using the organism field. I believe OBI:0000097 (participant under investigation role) is more appropriate, since I believe most of what we are capturing is the characteristics about the participant and their role in the study. If I think about this class as a concept that we want to compare across repositories, the concept is of a participant that plays a role in a study.
If you do this then you place the class under a different hierarchy, and it is no longer a material entity. In essence a human (or mouse or virus) becomes just a "role" and doesn't have an independent material existence. Maybe the fact that we are calling it Participant
instead of Organism
is confusing because it suggests that interpretation.
James can likely explain better, but I think pretty much all of our classes are going to be either material entities, information content entities, or processes because we are representing things in the real spatial world (material entity), or things in databases like Genbank records (information content entity), or biological processes on material entities (like VDJ recombination), or processes on informatics entities (like sequence annotation).
- Define the direct relations to other objects. What is the cardinality of the relation? Is there an ontology URI for the relation?
I am not sure we should have this as part of the checklist for the class itself. This seems like second step that may take a long time to resolve and shouldn't necessarily hold back the definition of the class.
I partially agree with some classes, but for this specific class the relations are clearly defined in the Miro board diagram for study design. Those are at least the core relations, and we can discuss if others are needed.
If you do this then you place the class under a different hierarchy, and it is no longer a material entity. In essence a human (or mouse or virus) becomes just a "role" and doesn't have an independent material existence. Maybe the fact that we are calling it
Participant
instead ofOrganism
is confusing because it suggests that interpretation.
Yes, I think that is the issue. The name of the class suggests that we are describing a participants role in the study. The actual name in the spreadsheet is StudyParticipant
which implies to me that the class is more oriented towards documenting the relationship an instance of this class might have to the study and its methodology - where role is a key attribute.
If this is more about the "Organism" then maybe the class should be "ExperimentalOrganism" or "StudyOrganism".
I should note that if this is indeed the case, then many of the fields are very much focused on "Homo sapiens" characteristics, and don't make sense for other organisms.
In essence a human (or mouse or virus) becomes just a "role" and doesn't have an independent material existence.
In the context of the StudyOrganism
concept, is it intended that this capture virus as well? It may be true that this makes sense as a StudyOrganism
since viruses are studied, but I don't think it makes sense in the context of relationships (as per the Miro board).
A StudyOrganism
participates-in
an Investigation
, it has-role
in a StudyArm
, and participates-in
LifeEvents
.
If we use StudyOrganism
to capture how a virus is involved in a study, I am pretty sure the relationships will be completely different, no? A virus certainly doesn't participate in an investigation in the same way a human subject does.
I don't see a representation of Virus
as an Organism
that is studied in a study in the Miro diagram. Presumably these might be related to InVivo ImmuneExposure, InVitro ImmuneExposure, Assay (for detection) but they are not represented in the diagram as far as I can tell.
- Define the direct relations to other objects. What is the cardinality of the relation? Is there an ontology URI for the relation?
Done - A StudyOrganism
participates-in
an Investigation
, it has-role
in a StudyArm
, and participates-in
LifeEvents
.
Captured in Google sheet.
@jamesaoverton any chance you can take a quick peak at the Particpant
(now ParticipantOrganism`) class in the Google Sheet before the meeting tomorrow? I am happy to walk through what I did but you having a look at it before that (for discussion in the meeting if you like) might be helpful - rather than seeing it for the first time in the meeting 8-)
I have fleshed it out to the best of by abilities for now 8-)
If not, no worries, we can look at it at the meeting.
We discussed this a bit on the call today. Here are some of my thoughts:
I'd prefer to keep the name 'Participant' for this class.
I think that the instances of Participant are organisms, not roles, because (1) it makes sense to talk about the biological sex of an organism but not of a role, and (2) we want to talk about things that happened to these organisms before the study started, but the 'participant under investigation role' for this study didn't exist until the study existed.
Since we're studying the immune system, all the participants must have an immune system, so they must all be vertebrates. So the class_uri
for Participant can be NCBITaxon:7776 for 'Gnathostomata
For IEDB work we use PATO 'genotypic sex'. Brian prefers PATO 'phenotypic sex'. I'm fine with whatever the group decides.
In my PR https://github.com/airr-knowledge/ak-schema/blob/jamesaoverton-1/src/ak_schema/schema/ak_schema.yaml I said that arms a kind of population, so a participant is a 'member of' its arm. I said the arm 'participates in' the investigation, and it would also be fine to say that the participant 'participates in' the investigation. A participant also 'participates in' its life events. We discussed how I think the participant 'has type' (rdf:type
) its species.
For IEDB work we use PATO 'genotypic sex'. Brian prefers PATO 'phenotypic sex'. I'm fine with whatever the group decides.
I think we might want both. In most cases in the AIRR world, we would not know genotypic sex, we would know self reported sex from subjects that are participating in studies. I believe it would be incorrect to report that as genotypic sex. I also understand that if you know the genotypic sex you probably want to report that.
For IEDB work we use PATO 'genotypic sex'. Brian prefers PATO 'phenotypic sex'. I'm fine with whatever the group decides.
I think we might want both. In most cases in the AIRR world, we would not know genotypic sex, we would know self reported sex from subjects that are participating in studies. I believe it would be incorrect to report that as genotypic sex. I also understand that if you know the genotypic sex you probably want to report that.
I'm not sure that we need to distinguish between the two. None of the repositories are differentiating, so assigning them to the appropriate slot might be fraught or error prone. Probably better to just use biological sex.
- Define the slots for the class. Every slot should have an URI either to an ontology term or schema.org term for its semantics. Is there a hierarchy of terms that are relevant for its semantics?
Done as draft - see Google Sheet
If I heard the description earlier correctly, race_specify
is how the race was described (specified) by the authors in the publication. That description might not match up well, or be more precise, than the simplified controlled vocabulary for race
, e.g. the NIH categories. This allows that original description to be saved in free text form. I'm thinking that we might not want this in the AKC, at least in the first pass.
My draft of the AKC schema (first in SQL then in LinkML) was based on our ImmuneSpace work, which was based on ImmPort. In ImmPort, 'race_specify' is a free-text field usually used when the 'race' is "Other". I agree that if AKC doesn't need this field, it should be removed from the schema. Likewise for other fields.
My goal has been a full draft that we can compile and test, and then improve.
I think this is pretty well done as a first pass for this class - See PR https://github.com/airr-knowledge/ak-schema/pull/5
Entity
See Google Sheet
See Google Sheet
See Google Sheet
See Google Sheet
Not a process
I believe that the class should be called ParticpantOrganism and it should have two subclasses, ParticipantHuman and ParticipantNonHuman
Slots for Class and Subclasses laid out in Google Sheet.