Open schristley opened 4 months ago
Hi @bcorrie so when I mentioned I'd like you to work with LinkML, I was really thinking about this. We played with that schema-automator
tool, and it seemed like if we process the AIRR schema file a little bit (maybe by extracting each object separately), this might be an easy way to get the AIRR stuff into LinkML.
@schristley where is the schema-automater
tool? Is it in one of the git hub repositories? I can't find it.
@bcorrie when I was playing around with it, I was just manually installing in the ak-schema docker. If I remember, the pip install doesn't install all dependencies, there was one that was missing.
I haven't put it in the ak-schema docker yet because I'm not sure if it conflicts with the linkml stuff or not
@bcorrie here it is, needed to also do pip install appengine-python-standard
@schristley the above install downgrades the urllib3 version from urllib3-2.2.1 to urllib3-1.26.18.
poetry.lock states that it requires urllib3 = ">=1.21.1,<3" so this should be fine.
Should we add this to the docker file? I have a patch that adds the following to the end of the Dockerfile after the poetry update:
RUN pip install appengine-python-standard
RUN pip install quantulum3[classifier]
@bcorrie Walking through the objects in the AIRR Schema to understand what we can auto-generate into LinkML and what we cannot, here's my assessment of a few things. Can you please review?
Repertoire
object and it's sub-objects like Subject
, SampleProcessing
, etc., will be de-normalized (and re-structured) into the CDM. We won't attempt to auto-generate LinkML, instead just perform mapping as part of the data integration.So based upon this, my thought is to do a quick "bootstrap" conversion of a few of the AIRR objects into LinkML. Germline, Genotype, Rearrangements and (maybe) Cell? That is, we won't worry about automation right now.
However as part #44 , we will need to consider how to manage AIRR schema changes and come up with an automation mechanism.
@bcorrie Walking through the objects in the AIRR Schema to understand what we can auto-generate into LinkML and what we cannot, here's my assessment of a few things. Can you please review?
I think my fundamental comment is that a mapping of any AIRR field to a LinkML slot definition is probably pretty straightforward. I think in all cases the relationships between the LinkML classes are what is going to be challenging.
For example, Germline
and Genotype
from a field perspective might be well defined and complete (you could generate a LinkML slot definition for each field), but the relationships between these classes and other classes in the AKC CDM are likely much less understood (at least I don't understand). Germline
is presumably linked to the AKC CDM equivalent of DataProcessing
and Genotype
is presumable linked to the AKC CDM equivalent of Participant
. These are ones that at least in some fashion already exist in the AIRR schema. What other links makes sense for these objects in the broader AKC CDM???
- With the new study design schema, the AIRR
Repertoire
object and it's sub-objects likeSubject
,SampleProcessing
, etc., will be de-normalized (and re-structured) into the CDM. We won't attempt to auto-generate LinkML, instead just perform mapping as part of the data integration.
Yes, I think so. Like most of the objects below, the field to slot mapping between AIRR and AKC is pretty straightforward. It is the relationships that are messy. The question around this we were discussing is is there some sort of automated tool that we might be able to use to help with this. The complicated part is going to be mapping the relationships (the de-normalization and re-structure) and I am not aware of anything that would help with this, if you know of anything let me know.
The reason this is hard is that these objects are at the core of the AKC CDM AND the relationships in the AIRR Standard do not map particularly well to the AKC CDM. We can think of things like Genotype
and Rearrangement
being transformed into LinkML easily because their relationship with other things in the AKC CDM are "simple" as we understand them today.
As we consider complex use cases, I would not be surprised if the requirements for complex relationships blows up. That is why I advocate for keeping the relationships in the AKC CDM as simple and basic as possible, with the anticipation that specific use cases are going to need much more complicated "knowledge graphs" overlaid on top of this simple set of relationships. If we try and capture all relationships for all things in the AKC CDM we will literally go insane 8-)
[Stuff Deleted]
So based upon this, my thought is to do a quick "bootstrap" conversion of a few of the AIRR objects into LinkML. Germline, Genotype, Rearrangements and (maybe) Cell? That is, we won't worry about automation right now.
As I mentioned, I was thinking of using the AIRRMap python class (https://github.com/sfu-ireceptor/dataloading-mongo/blob/master/dataload/airr_map.py) from the iReceptor data loader combined with the AIRR Spec Flatten tool in the iReceptor sandbox (https://github.com/sfu-ireceptor/sandbox/tree/master/airr-spec-flatten) as a first attempt at this.
I think it should be pretty easy to combine these to traverse any AIRR JSON object (e.g. Subject, Genotype, Rearrangement, ...) and generate a bunch of LinkML slots based on the AIRR definition.
I think for me the question is, are we talking about generating LinkML definitions for these objects, or actually generating LinkML compliant data from the equivalent definitions. I think generating LinkML definitions for slots should be pretty simple.
- We will hold off on receptor and reactivity until we get further into understanding the integration.
I would argue that like all of the other AIRR schema objects, converting the fields for Receptor
and Reactivity
are pretty straightforward. I don't see any reason not to convert these like any other objects.
As I say above, the reason this is "complicated" is that these objects have complex relationships not only with the AIRR Standard but across the other repositories as well (IEDB, iRAD). But it is the relationships we don't understand, I think we understand pretty well the actual field names.
I think my fundamental comment is that a mapping of any AIRR field to a LinkML slot definition is probably pretty straightforward. I think in all cases the relationships between the LinkML classes are what is going to be challenging.
Yes, I agree, thanks, I should be more clear. Let's not worry about trying to bring the relationships forward, just the slots and the classes (where class is the same as the AIRR JSON object).
- MHC Allele and Genotype ???
Genotype
is essentially done as stated above no?
MHCGenotype
I would suggest is "equally done" in the sense that the fields are defined and understood - they mirror Genotype
for the most part except they are more simple. There is only a set of alleles MHCAllele
(rather than alleles, deleted alleles, undocumented alleles as there are in Genotype).
Again, the relationships between these objects and other objects in the AKC CDM may be less defined, but mapping the fields I would suggest is pretty straightforward.
As we consider complex use cases, I would not be surprised if the requirements for complex relationships blows up. That is why I advocate for keeping the relationships in the AKC CDM as simple and basic as possible, with the anticipation that specific use cases are going to need much more complicated "knowledge graphs" overlaid on top of this simple set of relationships. If we try and capture all relationships for all things in the AKC CDM we will literally go insane 8-)
The point is well taken. With a "data model", we have well-defined relationships, often then translated into a static database design/schema, which then enforces how queries are done. A "knowledge model" needs to be more flexible to handle the complex use cases. LinkML is for data models, so we don't want to overload it and try to make it do too much. So in essence I'm agreeing with you, we'll keep the relationships in the CDM to the simple, basic and "obvious" ones.
So based upon this, my thought is to do a quick "bootstrap" conversion of a few of the AIRR objects into LinkML. Germline, Genotype, Rearrangements and (maybe) Cell? That is, we won't worry about automation right now.
As I mentioned, I was thinking of using the AIRRMap python class (https://github.com/sfu-ireceptor/dataloading-mongo/blob/master/dataload/airr_map.py) from the iReceptor data loader combined with the AIRR Spec Flatten tool in the iReceptor sandbox (https://github.com/sfu-ireceptor/sandbox/tree/master/airr-spec-flatten) as a first attempt at this.
I understand this as doing the actual data integration, versus the schema. I agree that a mapping approach like this should work very well for us.
However as part #44 , we will need to consider how to manage AIRR schema changes and come up with an automation mechanism.
The approach we have taken with the AIRR Config file and the use of AIRR Flatten should go a fairly long way to making this work. We can essentially turn an iReceptor Turnkey repository or an iReceptor Gateway so that it supports different versions of the AIRR Standard just by changing the AIRR Config file.
There are of course always special cases (e.g. string change to Ontology term) which can't be handled by a mapping, but this has gotten us a long way to making schema changes "relatively painless".
I think for me the question is, are we talking about generating LinkML definitions for these objects, or actually generating LinkML compliant data from the equivalent definitions. I think generating LinkML definitions for slots should be pretty simple.
Just the definitions, the slots and the classes.
I think it should be pretty easy to combine these to traverse any AIRR JSON object (e.g. Subject, Genotype, Rearrangement, ...) and generate a bunch of LinkML slots based on the AIRR definition.
@bcorrie Great! Can I give you the task to take an initial stab at writing this?
I think it should be pretty easy to combine these to traverse any AIRR JSON object (e.g. Subject, Genotype, Rearrangement, ...) and generate a bunch of LinkML slots based on the AIRR definition.
@bcorrie Great! Can I give you the task to take an initial stab at writing this?
Yep, no problem...
Initial version implemented.
Code is here: https://github.com/airr-knowledge/ak-schema/tree/airr-export/src/scripts/airr2akc
Initial exported schemas (a subset, although a pretty decent subset):
https://github.com/airr-knowledge/ak-schema/tree/airr-export/src/ak_schema/schema/airr
I essentially reused https://github.com/sfu-ireceptor/sandbox/tree/master/airr-spec-flatten and mostly just changed the output generation. I disabled recursion as well so it doesn't process Objects within Objects.
It isn't handling arrays correctly (although not sure how we do that in LinkML).
It also needs to have a mapping step so we have control if a field's attributes (e.g. name, type, range) are not the default that would be generated from the AIRR spec. I should be able to reuse the iReceptor data loader's AIRR Map capability to implement that pretty easily.
It looks like we should be able to use this to generate the Enums for fields as well, you will notice in the export I have created LinkML fields that capture either the AIRR ontology root node (for Ontologies) and the enum values for controlled vocabulary fields.
See the Ontology and Enum fields in the subject export: https://github.com/airr-knowledge/ak-schema/blob/airr-export/src/ak_schema/schema/airr/ak_airr_subject.yaml
I have changed the code so you can ask it to generate either the LinkML Slots or LinkML enums for the AIRR Schema Object of choice. For Ontology terms it just outputs the expected root node of the enum, we would still need a way to generate all of the children node for that enum.
For example, for Subject
it generates this:
Species:
name: Species
permissible_values:
Gnathostomata:
text: Gnathostomata
meaning: NCBITAXON:7776
Sex:
name: Sex
permissible_values:
male:
text: male
female:
text: female
pooled:
text: pooled
hermaphrodite:
text: hermaphrodite
intersex:
text: intersex
null:
text: null
AgeUnit:
name: AgeUnit
permissible_values:
time unit:
text: time unit
meaning: UO:0000003
If you ask for LinkML slots, it generates this - referring to the correct Enums above in the range
attribute.
subject_id:
name: subject_id
description: Subject ID assigned by submitter, unique within study. If possible, a persistent subject ID linked to an INSDC or similar repository study should be used.
range: string
synthetic:
name: synthetic
description: TRUE for libraries in which the diversity has been synthetically generated (e.g. phage display)
range: boolean
species:
name: species
description: Binomial designation of subject's species
range: Species
sex:
name: sex
description: Biological sex of subject
range: Sex
age_min:
name: age_min
description: Specific age or lower boundary of age range.
range: number
age_max:
name: age_max
description: Upper boundary of age range or equal to age_min for specific age. This field should only be null if age_min is null.
range: number
age_unit:
name: age_unit
description: Unit of age range
range: AgeUnit
age_event:
name: age_event
description: Event in the study schedule to which `Age` refers. For NCBI BioSample this MUST be `sampling`. For other implementations submitters need to be aware that there is currently no mechanism to encode to potential delta between `Age event` and `Sample collection time`, hence the chosen events should be in temporal proximity.
range: string
[Rest of the slots deleted]
Files generated for most (all?) AIRR schema objects of importance to AKC here:
https://github.com/airr-knowledge/ak-schema/tree/airr-export/src/ak_schema/schema/airr
Note some enum files are empty because there are no enums/ontologies in that particular class.
AKC will extend and integrate many of the classes/objects in the AIRR Data Model. LinkML has an importer for JSON schema. We want to automate the import/translation process so that we can run it when the AIRR Data Model changes.
x-airr
properties to LinkML. For example,identifier
is related LinkML'sidentifier
x-airr
properties need to be added to AIRR Standards for LinkML.