airr-knowledge / issues

Issues and project management for the AKC
0 stars 0 forks source link

Checklist for Participant Class #24

Closed bcorrie closed 3 months ago

bcorrie commented 4 months ago

Entity

See Google Sheet

See Google Sheet

See Google Sheet

See Google Sheet

Not a process

I believe that the class should be called ParticpantOrganism and it should have two subclasses, ParticipantHuman and ParticipantNonHuman

Slots for Class and Subclasses laid out in Google Sheet.

schristley commented 4 months ago
  • [ ] Is the class an entity or a process? Is it an information entity about a material entity/process, or a pure information entity/process?

Participant is a material entity.

bcorrie commented 4 months ago
  • [ ] Specify the ontology URI for class. Is there a hierarchy of ontology terms that are relevant?

http://purl.obolibrary.org/obo/OBI_0000097

bcorrie commented 4 months ago

@schristley @jamesaoverton before I go down a rabbit hole on this, since I believe the fields/slots for Particpant that we are starting with came from IEDB, I suspect that IEDB has URI's that would cover these slots??? It doesn't make a lot of sense for me to start from scratch on these. For example, I am sure there is an accepted ontology for biological_sex?

@jamesaoverton I am happy to copy and paste things in from an IEDB spec/definition, so I can save you that work if you can point me to such a thing...

jamesaoverton commented 4 months ago

@bcorrie Yes, we have a lot of these mappings from previous work, and we're happy to help. I'll make a public document.

For 'biological sex' we prefer PATO 'genotypic sex' http://purl.obolibrary.org/obo/PATO_0020000. For 'participant' we probably want OBI 'organism' http://purl.obolibrary.org/obo/OBI_0100026, or maybe something more specific.

bcorrie commented 4 months ago

I have been looking around at both LinkML and projects that use LinkML. This seems similar to our Particpant and uses LinkML under the hood presumably.

https://cancerdhc.github.io/ccdhmodel/v1.2/Subject/

I found the link to the Cancer Research Data Commons on the LinkML site:

https://linkml.io/linkml/examples.html#introductory-example

Turns out it isn't really helpful other than to confirm that our fields are not ridiculous - but we knew that.

So they have this nice Class/Entity model with relationships, but as far as I can tell they have no links to external definitions of any kind (no external URIs are used) and they define everything internally with enums. Am I missing something? I have to say LinkML thus far as confused me more than helped.

bcorrie commented 4 months ago

So far this has been the best link I have found that explains how LinkML works in a practical sense:

https://biolink.github.io/biolink-model/using-the-modeling-language/

schristley commented 4 months ago

@bcorrie How helpful would it before for you if you can enter LinkML and then view the corresponding JSON schema? If you think that will help then I can send some instructions for playing with ak-schema. It is relatively straightforward with using docker.

I should have instructions in the README but I've not gotten past just creating the base repo.

bcorrie commented 4 months ago

@schristley I think we should work through a couple of concrete examples like the Participant class and its fields - we being you, @jamesaoverton and I, and create a "complete" definition in the sense of coming up with and agreeing on what the columns in the spreadsheet should be and what the resultant LinkML object would look like. You are the project lead (you know what you want), James is the ontology expert, and I am the team member tasked with the job. I think if we have a "finished" class as concrete example of what is being looked for then the rest of the group will be able to move forward much more easily on the other classes.

I suggest Participant because I think it is about as simple a class as we have, with fields that I think should have clear definitions for what the field means (field definition) as well as well defined ontologies for what the values of the field should be able to take on (the range).

I think this is important, as when I look at a field like Participant.biological_sex I am still not clear what I should be doing 8-). If we are going to use this approach, it should be fairly trivial for me to deal with this field - and yet it isn't.

bcorrie commented 4 months ago

To be honest, LinkML is confusing me to no end. Why, oh why, does it have to use different terms for things (slots - really?).

And why do they overload the term "slot".

Basically everything in LinkML is a slot as far as I can tell...

Coming at this fresh from the outside, this makes their terminology a mess and their documentation confusing, at least to my tired old brain...

bcorrie commented 4 months ago

Fundamentally this class is simple (from here): https://linkml.io/linkml/intro/overview.html

classes:
  Participant:
    is_a: MaterialEntity ## parent class
    description: >-
      A participant in a study
    class_uri: schema:Person ##???
    slots:
      - participant_id
      - biological_sex
      - race
      - race_specify
      - ethnicity
      - geolocation

Call me done...

bcorrie commented 4 months ago

But what we really want is:

prefixes:
  OBI: http://purl.obolibrary.org/obo/OBI_
  IAO: http://purl.obolibrary.org/obo/IAO_
  NCIT: http://purl.obolibrary.org/obo/NCIT_
  PATO: http://purl.obolibrary.org/obo/PATO_

classes:
  Participant:
    is_a: MaterialEntity ## parent class
    description: >-
      A participant in a study
    class_uri: OBI:0000026
    attributes:
       participant_id:
         identifier: true
         slot_uri: IA0:0020000
         range: *
      biological_sex:
         slot_uri: PATO:0000047
         range: NCIT:C28421
      race:
         slot_uri: ???
         range: ???
      race_specify:
         slot_uri: ???
         range: ???
      ethnicity:
         slot_uri: ???
         range: ???
      geolocation:
         slot_uri: ???
         range: ???

Where I think each of the question marks in the above should be replaced by a value from a column in the object spreadsheet...

I have matched up the spreadsheet with the above for Participant.biological_sex.

schristley commented 4 months ago
  • [ ] Specify the ontology URI for class. Is there a hierarchy of ontology terms that are relevant?

http://purl.obolibrary.org/obo/OBI_0000097

This term is actually about the role of the participant versus the material entity. I would go with organism as suggested by James.

http://purl.obolibrary.org/obo/OBI_0100026

    class_uri: OBI:0100026
schristley commented 4 months ago

So they have this nice Class/Entity model with relationships, but as far as I can tell they have no links to external definitions of any kind (no external URIs are used) and they define everything internally with enums. Am I missing something? I have to say LinkML thus far as confused me more than helped.

I was confused about that as well as it seemed like all the enums where hard-coded controlled vocabulary. But then I found this Dynamic Enum. It is similar to how we coded this in the AIRR standards, specify 1) ontology (source_ontology) and 2) the root term (source_nodes). Though it looks more advanced in that you can specify multiple ontologies and exclude parts of the ontology tree.

bcorrie commented 4 months ago

I was confused about that as well as it seemed like all the enums where hard-coded controlled vocabulary. But then I found this Dynamic Enum. It is similar to how we coded this in the AIRR standards, specify 1) ontology (source_ontology) and 2) the root term (source_nodes). Though it looks more advanced in that you can specify multiple ontologies and exclude parts of the ontology tree.

That is cool, but as far as I can tell Cancer Research Data Commons does not use that, they have encoded everything as Static Enums (https://cancerdhc.github.io/ccdhmodel/v1.2/#enums). Admittedly it looks like many of them are autogenerated from some other source (which I can't find). So maybe this predates the DynamicEnum?

  enum_CRDCH_Subject_ethnicity:
    name: enum_CRDCH_Subject_ethnicity
    description: Autogenerated Enumeration for CRDC-H Subject ethnicity
    comments:
    - 'Name according to TCCM: "CRDC-H.Subject.ethnicity"'
    code_set: https://terminology.ccdh.io/enumerations/CRDC-H.Subject.ethnicity
    code_set_version: '2021-12-16T18:04:32.260053+00:00'

It would be nice if we could find this: https://terminology.ccdh.io/enumerations/CRDC-H.Subject.ethnicity

schristley commented 4 months ago

Fundamentally this class is simple (from here): https://linkml.io/linkml/intro/overview.html

classes:
  Participant:
    is_a: MaterialEntity ## parent class
    description: >-
      An organism that is a participant in a study
    class_uri: OBI:0100026
    slots:
      - participant_id
      - biological_sex
      - race
      - race_specify
      - ethnicity
      - geolocation

Call me done...

Yep, that is good enough. The slots can be defined in each class or globally in the slots object. I would initially going with global slots in case we want to re-use them on other classes. Then iniitially give them simple ranges as Type.

slots:
  biological_sex:
    range: string
  race:
    range: string
  ethnicity:
    range: string
  geolocation:
    range: string
schristley commented 4 months ago

In the AIRR spec, biological_sex is a controlled vocabulary enum, and both race and ethnicity are just strings. In AIRR, geolocation is in an ontology so let's add that in:

                    top_node:
                        id: GAZ:00000448
                        label: geographic location

Change the range to be an enum

slots:
  geolocation:
    range: GeoLocationEnum

Now define the enumeration. Unfortunately the doc doesn't do a good job of explaining what that the different properties so I'm just copy/pasting

enums:
  GeoLocationEnum:
      source_ontology: OBO:GAZ
      source_nodes:
        - GAZ:00000448
      include_self: false
      relationship_types:
        - rdfs:subClassOf
schristley commented 4 months ago

However, note that we are being a bit imprecise. The class_uri: OBI:0100026 is a more generic organism while our Participant implies a human. So we'll likely want to create a hierarchy of classes, so we can subclasses with specialized slots.

classes:
  Organism:
    is_a: MaterialEntity ## parent class
    description: >-
      An organism that is a participant in a investigation
    class_uri: OBI:0100026
    slots:
      - organism_id
      - biological_sex
      - geolocation

  HumanParticipant:
    is_a: Organism
    slots:
      - race
      - race_specify
      - ethnicity

  MouseParticipant:
    is_a: Organism
    slots:
    - strain_name

We'll need to discuss this a bit to make sure we are doing the right thing though. I imagine like with OO programming, sometimes you want to create a subclass and other times you just want a organism_type to handle variations...

schristley commented 4 months ago

That is cool, but as far as I can tell Cancer Research Data Commons does not use that, they have encoded everything as Static Enums (https://cancerdhc.github.io/ccdhmodel/v1.2/#enums).

Yes, I'd seen and wondered that too, before I learned they were using LinkML. Now I can guess why... It is autogenerated, as that's how LinkML works. Though, are they doing it for optimization, or because dynamic enums were not available, or maybe they didn't want dynamic lookup? I don't know. Dynamic enums imply adding some code to do the dynamic lookup, so if all the enums values will fit into memory then that's fast.

schristley commented 4 months ago

However, note that we are being a bit imprecise. The class_uri: OBI:0100026 is a more generic organism while our Participant implies a human. So we'll likely want to create a hierarchy of classes, so we can subclasses with specialized slots.

A further imprecision is that "participant in an investigation" includes people who are performing the investigation like researchers and grad students, versus participants that are organisms being sampling who might also be people. I believe this means that we will want a has_role slot.

bcorrie commented 4 months ago

I will wait for @jamesaoverton to provide some info on IEDB's mappings before going much further. I don't think it makes sense for me to spend time trying to figure a field out when the IEDB team has years of experience and has gone through many trials and tribulations on choosing these 8-)

I am starting to get a handle on LinkML at least.

jamesaoverton commented 4 months ago

Here is a current draft of the ImmuneSpace terminology, which should be a good starting point.

bcorrie commented 4 months ago
  • If the class is a process, there should be slots for the inputs/outputs of the process. They may be directly defined in the class or are relations.

Not a process, checking this off.

bcorrie commented 4 months ago
  • Define the slots for the class. Every slot should have an URI either to an ontology term or schema.org term for its semantics. Is there a hierarchy of terms that are relevant for its semantics?

Done as draft - see Google Sheet

bcorrie commented 4 months ago
  • [ ] Define the attributes of each slot. Identifier? Required? Data type? Controlled vocabulary or ontology for the values?

Also done as first draft

bcorrie commented 4 months ago
  • Define the direct relations to other objects. What is the cardinality of the relation? Is there an ontology URI for the relation?

I am not sure we should have this as part of the checklist for the class itself. This seems like second step that may take a long time to resolve and shouldn't necessarily hold back the definition of the class.

bcorrie commented 4 months ago
  • [ ] Specify the ontology URI for class. Is there a hierarchy of ontology terms that are relevant?

http://purl.obolibrary.org/obo/OBI_0000097

This term is actually about the role of the participant versus the material entity. I would go with organism as suggested by James.

http://purl.obolibrary.org/obo/OBI_0100026

I am not sure about using the organism field. I believe OBI:0000097 (participant under investigation role) is more appropriate, since I believe most of what we are capturing is the characteristics about the participant and their role in the study. If I think about this class as a concept that we want to compare across repositories, the concept is of a participant that plays a role in a study.

Note we noted before that we were missing species/organism field and I have added that, but that is a "slot" that contains this information rather than the foundation of the class.

I have changed this back to OBI:0000097 for now but I key OBI:0100026 in the comments.

schristley commented 4 months ago
  • [ ] Specify the ontology URI for class. Is there a hierarchy of ontology terms that are relevant?

http://purl.obolibrary.org/obo/OBI_0000097

This term is actually about the role of the participant versus the material entity. I would go with organism as suggested by James. http://purl.obolibrary.org/obo/OBI_0100026

I am not sure about using the organism field. I believe OBI:0000097 (participant under investigation role) is more appropriate, since I believe most of what we are capturing is the characteristics about the participant and their role in the study. If I think about this class as a concept that we want to compare across repositories, the concept is of a participant that plays a role in a study.

If you do this then you place the class under a different hierarchy, and it is no longer a material entity. In essence a human (or mouse or virus) becomes just a "role" and doesn't have an independent material existence. Maybe the fact that we are calling it Participant instead of Organism is confusing because it suggests that interpretation.

James can likely explain better, but I think pretty much all of our classes are going to be either material entities, information content entities, or processes because we are representing things in the real spatial world (material entity), or things in databases like Genbank records (information content entity), or biological processes on material entities (like VDJ recombination), or processes on informatics entities (like sequence annotation).

schristley commented 4 months ago
  • Define the direct relations to other objects. What is the cardinality of the relation? Is there an ontology URI for the relation?

I am not sure we should have this as part of the checklist for the class itself. This seems like second step that may take a long time to resolve and shouldn't necessarily hold back the definition of the class.

I partially agree with some classes, but for this specific class the relations are clearly defined in the Miro board diagram for study design. Those are at least the core relations, and we can discuss if others are needed.

bcorrie commented 4 months ago

If you do this then you place the class under a different hierarchy, and it is no longer a material entity. In essence a human (or mouse or virus) becomes just a "role" and doesn't have an independent material existence. Maybe the fact that we are calling it Participant instead of Organism is confusing because it suggests that interpretation.

Yes, I think that is the issue. The name of the class suggests that we are describing a participants role in the study. The actual name in the spreadsheet is StudyParticipant which implies to me that the class is more oriented towards documenting the relationship an instance of this class might have to the study and its methodology - where role is a key attribute.

If this is more about the "Organism" then maybe the class should be "ExperimentalOrganism" or "StudyOrganism".

I should note that if this is indeed the case, then many of the fields are very much focused on "Homo sapiens" characteristics, and don't make sense for other organisms.

bcorrie commented 4 months ago

In essence a human (or mouse or virus) becomes just a "role" and doesn't have an independent material existence.

In the context of the StudyOrganism concept, is it intended that this capture virus as well? It may be true that this makes sense as a StudyOrganism since viruses are studied, but I don't think it makes sense in the context of relationships (as per the Miro board).

A StudyOrganism participates-in an Investigation, it has-role in a StudyArm, and participates-in LifeEvents.

If we use StudyOrganism to capture how a virus is involved in a study, I am pretty sure the relationships will be completely different, no? A virus certainly doesn't participate in an investigation in the same way a human subject does.

bcorrie commented 4 months ago

I don't see a representation of Virus as an Organism that is studied in a study in the Miro diagram. Presumably these might be related to InVivo ImmuneExposure, InVitro ImmuneExposure, Assay (for detection) but they are not represented in the diagram as far as I can tell.

bcorrie commented 4 months ago
  • Define the direct relations to other objects. What is the cardinality of the relation? Is there an ontology URI for the relation?

Done - A StudyOrganism participates-in an Investigation, it has-role in a StudyArm, and participates-in LifeEvents.

Captured in Google sheet.

bcorrie commented 4 months ago

@jamesaoverton any chance you can take a quick peak at the Particpant (now ParticipantOrganism`) class in the Google Sheet before the meeting tomorrow? I am happy to walk through what I did but you having a look at it before that (for discussion in the meeting if you like) might be helpful - rather than seeing it for the first time in the meeting 8-)

I have fleshed it out to the best of by abilities for now 8-)

If not, no worries, we can look at it at the meeting.

jamesaoverton commented 4 months ago

We discussed this a bit on the call today. Here are some of my thoughts:

I'd prefer to keep the name 'Participant' for this class.

I think that the instances of Participant are organisms, not roles, because (1) it makes sense to talk about the biological sex of an organism but not of a role, and (2) we want to talk about things that happened to these organisms before the study started, but the 'participant under investigation role' for this study didn't exist until the study existed.

Since we're studying the immune system, all the participants must have an immune system, so they must all be vertebrates. So the class_uri for Participant can be NCBITaxon:7776 for 'Gnathostomata '.

For IEDB work we use PATO 'genotypic sex'. Brian prefers PATO 'phenotypic sex'. I'm fine with whatever the group decides.

In my PR https://github.com/airr-knowledge/ak-schema/blob/jamesaoverton-1/src/ak_schema/schema/ak_schema.yaml I said that arms a kind of population, so a participant is a 'member of' its arm. I said the arm 'participates in' the investigation, and it would also be fine to say that the participant 'participates in' the investigation. A participant also 'participates in' its life events. We discussed how I think the participant 'has type' (rdf:type) its species.

bcorrie commented 4 months ago

For IEDB work we use PATO 'genotypic sex'. Brian prefers PATO 'phenotypic sex'. I'm fine with whatever the group decides.

I think we might want both. In most cases in the AIRR world, we would not know genotypic sex, we would know self reported sex from subjects that are participating in studies. I believe it would be incorrect to report that as genotypic sex. I also understand that if you know the genotypic sex you probably want to report that.

schristley commented 4 months ago

For IEDB work we use PATO 'genotypic sex'. Brian prefers PATO 'phenotypic sex'. I'm fine with whatever the group decides.

I think we might want both. In most cases in the AIRR world, we would not know genotypic sex, we would know self reported sex from subjects that are participating in studies. I believe it would be incorrect to report that as genotypic sex. I also understand that if you know the genotypic sex you probably want to report that.

I'm not sure that we need to distinguish between the two. None of the repositories are differentiating, so assigning them to the appropriate slot might be fraught or error prone. Probably better to just use biological sex.

schristley commented 3 months ago
  • Define the slots for the class. Every slot should have an URI either to an ontology term or schema.org term for its semantics. Is there a hierarchy of terms that are relevant for its semantics?

Done as draft - see Google Sheet

If I heard the description earlier correctly, race_specify is how the race was described (specified) by the authors in the publication. That description might not match up well, or be more precise, than the simplified controlled vocabulary for race, e.g. the NIH categories. This allows that original description to be saved in free text form. I'm thinking that we might not want this in the AKC, at least in the first pass.

jamesaoverton commented 3 months ago

My draft of the AKC schema (first in SQL then in LinkML) was based on our ImmuneSpace work, which was based on ImmPort. In ImmPort, 'race_specify' is a free-text field usually used when the 'race' is "Other". I agree that if AKC doesn't need this field, it should be removed from the schema. Likewise for other fields.

My goal has been a full draft that we can compile and test, and then improve.

bcorrie commented 3 months ago

I think this is pretty well done as a first pass for this class - See PR https://github.com/airr-knowledge/ak-schema/pull/5