airr-knowledge / issues

Issues and project management for the AKC
0 stars 0 forks source link

import AIRR Data Model into AKC LinkML #28

Open schristley opened 4 months ago

schristley commented 4 months ago

AKC will extend and integrate many of the classes/objects in the AIRR Data Model. LinkML has an importer for JSON schema. We want to automate the import/translation process so that we can run it when the AIRR Data Model changes.

schristley commented 3 months ago

Hi @bcorrie so when I mentioned I'd like you to work with LinkML, I was really thinking about this. We played with that schema-automator tool, and it seemed like if we process the AIRR schema file a little bit (maybe by extracting each object separately), this might be an easy way to get the AIRR stuff into LinkML.

bcorrie commented 2 months ago

@schristley where is the schema-automater tool? Is it in one of the git hub repositories? I can't find it.

schristley commented 2 months ago

@bcorrie when I was playing around with it, I was just manually installing in the ak-schema docker. If I remember, the pip install doesn't install all dependencies, there was one that was missing.

I haven't put it in the ak-schema docker yet because I'm not sure if it conflicts with the linkml stuff or not

schristley commented 2 months ago

@bcorrie here it is, needed to also do pip install appengine-python-standard

bcorrie commented 2 months ago

@schristley the above install downgrades the urllib3 version from urllib3-2.2.1 to urllib3-1.26.18.

poetry.lock states that it requires urllib3 = ">=1.21.1,<3" so this should be fine.

Should we add this to the docker file? I have a patch that adds the following to the end of the Dockerfile after the poetry update:

RUN pip install appengine-python-standard
RUN pip install quantulum3[classifier]
schristley commented 2 months ago

@bcorrie Walking through the objects in the AIRR Schema to understand what we can auto-generate into LinkML and what we cannot, here's my assessment of a few things. Can you please review?

So based upon this, my thought is to do a quick "bootstrap" conversion of a few of the AIRR objects into LinkML. Germline, Genotype, Rearrangements and (maybe) Cell? That is, we won't worry about automation right now.

However as part #44 , we will need to consider how to manage AIRR schema changes and come up with an automation mechanism.

bcorrie commented 2 months ago

@bcorrie Walking through the objects in the AIRR Schema to understand what we can auto-generate into LinkML and what we cannot, here's my assessment of a few things. Can you please review?

I think my fundamental comment is that a mapping of any AIRR field to a LinkML slot definition is probably pretty straightforward. I think in all cases the relationships between the LinkML classes are what is going to be challenging.

For example, Germline and Genotype from a field perspective might be well defined and complete (you could generate a LinkML slot definition for each field), but the relationships between these classes and other classes in the AKC CDM are likely much less understood (at least I don't understand). Germline is presumably linked to the AKC CDM equivalent of DataProcessing and Genotype is presumable linked to the AKC CDM equivalent of Participant. These are ones that at least in some fashion already exist in the AIRR schema. What other links makes sense for these objects in the broader AKC CDM???

  • With the new study design schema, the AIRR Repertoire object and it's sub-objects like Subject, SampleProcessing, etc., will be de-normalized (and re-structured) into the CDM. We won't attempt to auto-generate LinkML, instead just perform mapping as part of the data integration.

Yes, I think so. Like most of the objects below, the field to slot mapping between AIRR and AKC is pretty straightforward. It is the relationships that are messy. The question around this we were discussing is is there some sort of automated tool that we might be able to use to help with this. The complicated part is going to be mapping the relationships (the de-normalization and re-structure) and I am not aware of anything that would help with this, if you know of anything let me know.

The reason this is hard is that these objects are at the core of the AKC CDM AND the relationships in the AIRR Standard do not map particularly well to the AKC CDM. We can think of things like Genotype and Rearrangement being transformed into LinkML easily because their relationship with other things in the AKC CDM are "simple" as we understand them today.

As we consider complex use cases, I would not be surprised if the requirements for complex relationships blows up. That is why I advocate for keeping the relationships in the AKC CDM as simple and basic as possible, with the anticipation that specific use cases are going to need much more complicated "knowledge graphs" overlaid on top of this simple set of relationships. If we try and capture all relationships for all things in the AKC CDM we will literally go insane 8-)

[Stuff Deleted]

So based upon this, my thought is to do a quick "bootstrap" conversion of a few of the AIRR objects into LinkML. Germline, Genotype, Rearrangements and (maybe) Cell? That is, we won't worry about automation right now.

As I mentioned, I was thinking of using the AIRRMap python class (https://github.com/sfu-ireceptor/dataloading-mongo/blob/master/dataload/airr_map.py) from the iReceptor data loader combined with the AIRR Spec Flatten tool in the iReceptor sandbox (https://github.com/sfu-ireceptor/sandbox/tree/master/airr-spec-flatten) as a first attempt at this.

I think it should be pretty easy to combine these to traverse any AIRR JSON object (e.g. Subject, Genotype, Rearrangement, ...) and generate a bunch of LinkML slots based on the AIRR definition.

I think for me the question is, are we talking about generating LinkML definitions for these objects, or actually generating LinkML compliant data from the equivalent definitions. I think generating LinkML definitions for slots should be pretty simple.

bcorrie commented 2 months ago
  • We will hold off on receptor and reactivity until we get further into understanding the integration.

I would argue that like all of the other AIRR schema objects, converting the fields for Receptor and Reactivity are pretty straightforward. I don't see any reason not to convert these like any other objects.

As I say above, the reason this is "complicated" is that these objects have complex relationships not only with the AIRR Standard but across the other repositories as well (IEDB, iRAD). But it is the relationships we don't understand, I think we understand pretty well the actual field names.

schristley commented 2 months ago

I think my fundamental comment is that a mapping of any AIRR field to a LinkML slot definition is probably pretty straightforward. I think in all cases the relationships between the LinkML classes are what is going to be challenging.

Yes, I agree, thanks, I should be more clear. Let's not worry about trying to bring the relationships forward, just the slots and the classes (where class is the same as the AIRR JSON object).

bcorrie commented 2 months ago
  • MHC Allele and Genotype ???

Genotype is essentially done as stated above no?

MHCGenotype I would suggest is "equally done" in the sense that the fields are defined and understood - they mirror Genotype for the most part except they are more simple. There is only a set of alleles MHCAllele (rather than alleles, deleted alleles, undocumented alleles as there are in Genotype).

Again, the relationships between these objects and other objects in the AKC CDM may be less defined, but mapping the fields I would suggest is pretty straightforward.

schristley commented 2 months ago

As we consider complex use cases, I would not be surprised if the requirements for complex relationships blows up. That is why I advocate for keeping the relationships in the AKC CDM as simple and basic as possible, with the anticipation that specific use cases are going to need much more complicated "knowledge graphs" overlaid on top of this simple set of relationships. If we try and capture all relationships for all things in the AKC CDM we will literally go insane 8-)

The point is well taken. With a "data model", we have well-defined relationships, often then translated into a static database design/schema, which then enforces how queries are done. A "knowledge model" needs to be more flexible to handle the complex use cases. LinkML is for data models, so we don't want to overload it and try to make it do too much. So in essence I'm agreeing with you, we'll keep the relationships in the CDM to the simple, basic and "obvious" ones.

schristley commented 2 months ago

So based upon this, my thought is to do a quick "bootstrap" conversion of a few of the AIRR objects into LinkML. Germline, Genotype, Rearrangements and (maybe) Cell? That is, we won't worry about automation right now.

As I mentioned, I was thinking of using the AIRRMap python class (https://github.com/sfu-ireceptor/dataloading-mongo/blob/master/dataload/airr_map.py) from the iReceptor data loader combined with the AIRR Spec Flatten tool in the iReceptor sandbox (https://github.com/sfu-ireceptor/sandbox/tree/master/airr-spec-flatten) as a first attempt at this.

I understand this as doing the actual data integration, versus the schema. I agree that a mapping approach like this should work very well for us.

bcorrie commented 2 months ago

However as part #44 , we will need to consider how to manage AIRR schema changes and come up with an automation mechanism.

The approach we have taken with the AIRR Config file and the use of AIRR Flatten should go a fairly long way to making this work. We can essentially turn an iReceptor Turnkey repository or an iReceptor Gateway so that it supports different versions of the AIRR Standard just by changing the AIRR Config file.

There are of course always special cases (e.g. string change to Ontology term) which can't be handled by a mapping, but this has gotten us a long way to making schema changes "relatively painless".

schristley commented 2 months ago

I think for me the question is, are we talking about generating LinkML definitions for these objects, or actually generating LinkML compliant data from the equivalent definitions. I think generating LinkML definitions for slots should be pretty simple.

Just the definitions, the slots and the classes.

I think it should be pretty easy to combine these to traverse any AIRR JSON object (e.g. Subject, Genotype, Rearrangement, ...) and generate a bunch of LinkML slots based on the AIRR definition.

@bcorrie Great! Can I give you the task to take an initial stab at writing this?

bcorrie commented 2 months ago

I think it should be pretty easy to combine these to traverse any AIRR JSON object (e.g. Subject, Genotype, Rearrangement, ...) and generate a bunch of LinkML slots based on the AIRR definition.

@bcorrie Great! Can I give you the task to take an initial stab at writing this?

Yep, no problem...

bcorrie commented 1 month ago

Initial version implemented.

Code is here: https://github.com/airr-knowledge/ak-schema/tree/airr-export/src/scripts/airr2akc

Initial exported schemas (a subset, although a pretty decent subset):

https://github.com/airr-knowledge/ak-schema/tree/airr-export/src/ak_schema/schema/airr

I essentially reused https://github.com/sfu-ireceptor/sandbox/tree/master/airr-spec-flatten and mostly just changed the output generation. I disabled recursion as well so it doesn't process Objects within Objects.

It isn't handling arrays correctly (although not sure how we do that in LinkML).

It also needs to have a mapping step so we have control if a field's attributes (e.g. name, type, range) are not the default that would be generated from the AIRR spec. I should be able to reuse the iReceptor data loader's AIRR Map capability to implement that pretty easily.

bcorrie commented 1 month ago

It looks like we should be able to use this to generate the Enums for fields as well, you will notice in the export I have created LinkML fields that capture either the AIRR ontology root node (for Ontologies) and the enum values for controlled vocabulary fields.

See the Ontology and Enum fields in the subject export: https://github.com/airr-knowledge/ak-schema/blob/airr-export/src/ak_schema/schema/airr/ak_airr_subject.yaml

bcorrie commented 1 month ago

I have changed the code so you can ask it to generate either the LinkML Slots or LinkML enums for the AIRR Schema Object of choice. For Ontology terms it just outputs the expected root node of the enum, we would still need a way to generate all of the children node for that enum.

For example, for Subject it generates this:

Species:
  name: Species
  permissible_values:
    Gnathostomata:
      text: Gnathostomata
      meaning: NCBITAXON:7776
Sex:
  name: Sex
  permissible_values:
    male:
      text: male
    female:
      text: female
    pooled:
      text: pooled
    hermaphrodite:
      text: hermaphrodite
    intersex:
      text: intersex
    null:
      text: null
AgeUnit:
  name: AgeUnit
  permissible_values:
    time unit:
      text: time unit
      meaning: UO:0000003
bcorrie commented 1 month ago

If you ask for LinkML slots, it generates this - referring to the correct Enums above in the range attribute.

subject_id:
  name: subject_id
  description: Subject ID assigned by submitter, unique within study. If possible, a persistent subject ID linked to an INSDC or similar repository study should be used.
  range: string
synthetic:
  name: synthetic
  description: TRUE for libraries in which the diversity has been synthetically generated (e.g. phage display)
  range: boolean
species:
  name: species
  description: Binomial designation of subject's species
  range: Species
sex:
  name: sex
  description: Biological sex of subject
  range: Sex
age_min:
  name: age_min
  description: Specific age or lower boundary of age range.
  range: number
age_max:
  name: age_max
  description: Upper boundary of age range or equal to age_min for specific age. This field should only be null if age_min is null.
  range: number
age_unit:
  name: age_unit
  description: Unit of age range
  range: AgeUnit
age_event:
  name: age_event
  description: Event in the study schedule to which `Age` refers. For NCBI BioSample this MUST be `sampling`. For other implementations submitters need to be aware that there is currently no mechanism to encode to potential delta between `Age event` and `Sample collection time`, hence the chosen events should be in temporal proximity.
  range: string

[Rest of the slots deleted]
bcorrie commented 1 month ago

Files generated for most (all?) AIRR schema objects of importance to AKC here:

https://github.com/airr-knowledge/ak-schema/tree/airr-export/src/ak_schema/schema/airr

Note some enum files are empty because there are no enums/ontologies in that particular class.