LzLang / Zendro-Converter

A script to convert data from BrAPI to Zendro-API
1 stars 1 forks source link

Open questions about the ISA JSON Model #4

Open asishallab opened 1 year ago

asishallab commented 1 year ago

Open Questions

When parsing and reading through the ISA JSON Model a few questions arose. They are listed here.

How to treat properties of type object

In some cases BrApi JSON data models have properties of type object. We can model them in Zendro in a number of ways.

Probably this should be decided on a case-by-case level?

Example: additionalInfo and additionalProperties e.g. in Person.json.

Structure of additionalInfo

The definition taken from Person.json says:

{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "properties": {
        "additionalInfo": {
            "additionalProperties": {
                "type": "string"
            },
            "description": "Additional arbitrary info",
            "type": [
                "null",
                "object"
            ]
        },

So, according to this specification, a person can have additional info. But, what is the structure of this object? The object additionalInfo can have a number of additionalProperties that are of type string?

Reply from meeting with the BrApi group

additionalInfo should be the only case, where we see non formatted data. In the BrApi test server we serialize and store this object as JSON.

How to model externalReferences?

The array of external references is found in the Person model:

"externalReferences": {
            "description": "An array of external reference ids. These are references to this piece of data in an external system. Could be a simple string or a URI.",
            "items": {
                "properties": {
                    "referenceId": {
                        "description": "The external reference ID. Could be a simple string or a URI.",
                        "type": [
                            "null",
                            "string"
                        ]
                    },
                    "referenceSource": {
                        "description": "An identifier for the source system or database of this reference",
                        "type": [
                            "null",
                            "string"
                        ]
                    }
                },
                "required": [
                    "referenceId",
                    "referenceSource"
                ],
                "type": "object"
            },
            "title": "ExternalReferences",
            "type": [
                "null",
                "array"
            ]
        },

There are several question about this specification:

Response from the BrApi development team

Another example would be the field-book-App:

Validation

Zendro has the capability to use any validation function on provided data. The Zendro framework can validate both data formats (syntactically) and data values (semantically). However, if the database has to be queried, we should consider whether this might be a performance bottleneck.

Example taken from Sample.json:

"column": {
            "description": "The Column identifier for this `Sample` location in the `Plate`",
            "maximum": 12,
            "minimum": 1,
            "type": [
                "null",
                "integer"
            ]
        },

Questions:

Relationships / Associations

For some associations we see the foreign keys implemented in the JSON Specs, e.g. in Sample.json:

 "germplasmDbId": {
            "description": "The ID which uniquely identifies a `Germplasm`",
            "type": [
                "null",
                "string"
            ]
        },
        "observationUnitDbId": {
            "description": "The ID which uniquely identifies an `ObservationUnit`",
            "type": [
                "null",
                "string"
            ]
        },
        "plateDbId": {
            "description": "The ID which uniquely identifies a `Plate` of `Sample`",
            "type": [
                "null",
                "string"
            ]
        },

Here, we can conclude from the name of the foreign key and its existence:

However a formal specification of all relationships would be extremely helpful and resolve open questions.

To be excluded properties

In some data models foreign keys are stated. Also, to spare the user to send another request to the RESTful API some of the properties of the associated (relationship) models are stored, too. See this example taken from Sample.json:

        "plateDbId": {
            "description": "The ID which uniquely identifies a `Plate` of `Sample`",
            "type": [
                "null",
                "string"
            ]
        },
        "plateName": {
            "description": "The human readable name of a `Plate`",
            "type": [
                "null",
                "string"
            ]
        },

Using GraphQL these properties are not required. GraphQL specifically allows to fetch within a single HTTP-Request all data the user wants, including properties of related (associated) data models. Furthermore, given we at some point have a formal description of relationships between data models, foreign keys would ideally no longer be listed among data model definitions. Is there a way, we can recognize these "to be excluded" properties and not include them in the final GraphQL data model definitions. An easy quick and dirty solution would be a simple exclusion list?

Response from the GraphQL development group

LzLang commented 9 months ago

Request for uniforming the way relationships/associations are implemented.

Almost all schemes use a uniform format for the associations, for example observationVariables from Study.json:

{
    "$defs": {
        "Study": {
            "properties": {
                "observationVariables": {
                    "description": "The list of Observation Variables being used in this study. \n\nThis list is intended to be the wishlist of variables to collect in this study. It may or may not match the set of variables used in the collected observation records. ",
                    "items": {
                        "$ref": "ObservationVariable.json#/$defs/ObservationVariable"
                    },
                    "referencedAttribute": "studies",
                    "relationshipType": "many-to-many",
                    "type": "array"
                }
            }
    },
    "$id": "https://brapi.org/Specification/BrAPI-Schema/BrAPI-Core/Study.json",
    "$schema": "http://json-schema.org/draft/2020-12/schema"
}

As you can see the the association is a property itself therefore there is no problem in automatic converting the relationships. While working on associations I noticed that in all 3 associations are defined differently than all others, these are:

All of these three associations are defined differently than the others. They are not defined as own property but rather as a nested property, as example: parentGermplasm:

{
    "$defs": {
        "PedigreeNode": {
            "properties": {
                "parents": {
                    "description": "A list of parent germplasm references in the pedigree tree for this germplasm. These represent edges in the tree, connecting to other nodes.\n<br/> Typically, this array should only have one parent (clonal or self) or two parents (cross). In some special cases, there may be more parents, usually when the exact parent is not known. \n<br/> If the parameter 'includeParents' is set to false, then this array should be empty, null, or not present in the response.",
                    "items": {
                        "properties": {
                            "parentGermplasm": {
                                "$ref": "Germplasm.json#/$defs/Germplasm",
                                "description": "The ID which uniquely identifies a parent germplasm",
                                "referencedAttribute": "progenyPedigreeNodes",
                                "relationshipType": "many-to-one"
                            },
                            "parentType": {
                                "description": "The type of parent used during crossing. Accepted values for this field are 'MALE', 'FEMALE', 'SELF', 'POPULATION', and 'CLONAL'. \n\nIn a pedigree record, the 'parentType' describes each parent of a particular germplasm. \n\nIn a progeny record, the 'parentType' is used to describe how this germplasm was crossed to generate a particular progeny. \nFor example, given a record for germplasm A, having a progeny B and C. The 'parentType' field for progeny B item refers \nto the 'parentType' of A toward B. The 'parentType' field for progeny C item refers to the 'parentType' of A toward C.\nIn this way, A could be a male parent to B, but a female parent to C. ",
                                "enum": [
                                    "MALE",
                                    "FEMALE",
                                    "SELF",
                                    "POPULATION",
                                    "CLONAL"
                                ],
                                "type": "string"
                            }
                        },
                        "required": [
                            "germplasmDbId",
                            "parentType"
                        ],
                        "type": "object"
                    },
                    "type": [
                        "null",
                        "array"
                    ]
                }

            }
    },
    "$id": "https://brapi.org/Specification/BrAPI-Schema/BrAPI-Germplasm/PedigreeNode.json",
    "$schema": "http://json-schema.org/draft/2020-12/schema"
}

As you can the the association is here defined as a nested property of the property parents. The converter ignores nested properties therefore is the association also ignored.

Is there a possibility to uniform the format of the associations and define them as individual property? This would be very helpful!

BrapiCoordinatorSelby commented 7 months ago

The siblingsGermplasm is something I can fix immediately. But parentGermplasm and progenyGermplasm are a little bit tricky. While they are referencing an array of Germplasm elements, they also need the additional metadata parentType associated with each Germplasm. I think we need some kind of polymorphism for the Germplasm entity in this case. I think the model proposed in this Blog post might work for us: https://json-schema.org/blog/posts/modelling-inheritance

It would look something like this:

{
    "$defs": {
        "PedigreeNode": {
            "properties": {
                "parents": {
                    "description": "A list of parent germplasm referen...,",
                    "referencedAttribute": "progenyPedigreeNodes",
                    "relationshipType": "many-to-one",
                    "items": {
                        "type": "object",
                        "$ref": "Germplasm.json#/$defs/Germplasm",
                        "properties": {
                            "parentType": {
                                "description": "The type of parent used du... ",
                                "enum": ["MALE", "FEMALE","SELF", "POPULATION", "CLONAL" ],
                                "type": "string"
                            }
                        },
                        "required": ["germplasmDbId","parentType"],
                    },
                    "type": ["null","array"]
                }
            }
    },
    "$id": "https://brapi.org/Specification/BrAPI-Schema/BrAPI-Germplasm/PedigreeNode.json",
    "$schema": "http://json-schema.org/draft/2020-12/schema"
}

@LzLang Will this work for Zendro? Will it be able to pick up the reference to Germplasm AND keep the additional property parentType? I don't know how polymorphism works with GraphQL...

asishallab commented 6 months ago

Notes on how to resolve the above issue(s)

"Nested relationships"

So, any one-to-one or one-to-many relation to objects that do not have a separate data model definitions we dub "nested". It'd be helpful to discontinue usage of such nested relationships and rather have separate JSON data model definitions for those and then define the relationships as in all cases.

List of explicitly defined foreign keys

Some data models have foreign keys stated, which should be excluded from the "standard" data model definition. @LzLang will provide us with a list of these keys in order to remove them from the JSON model definitions.

Note that currently in the context of automated data warehouse generation with Zendro, we automatically create foreign keys for each association.

"Compound foreign keys"

In Zendro with only support single foreign keys, of course we could have one for the mother germplasm id and another one for the father. This would be a solution everywhere where we know how many associations we have to the same data model.

LzLang commented 5 months ago

Possible Solution for nested properties

Hello @BrapiCoordinatorSelby ,

we worked on the nested properties issue and tried to separate those into different/there own models. We used your Cross.json schema and modified it. Could you please review the idea and tell us your opinion? Basically we have to modify the schema manually.

Cross.json now (condensed to the changed attributes):

{
    "$defs": {
        "Cross": {
            "properties": {
                "crossAttributes": {
                    "referencedAttribute": "cross",
                    "relationshipType": "one-to-many",
                    "items": {
                        "$ref": "CrossAttribute.json#/$defs/CrossAttribute",
                        "description": "Set of custom attributes associated with a cross"
                    },
                    "type": [
                        "null",
                        "array"
                    ]
                },
                "externalReferences": {
                    "referencedAttribute": "cross",
                    "relationshipType": "one-to-many",
                    "items": {
                        "$ref": "CrossExternalReferences.json#/$defs/CrossExternalReferences",
                        "description": "An array of external reference ids. These are references to this piece of data in an external system. Could be a simple string or a URI."               
                    },
                    "type": [
                        "null",
                        "array"
                    ]
                },
                "parent1": {
                    "$ref": "Germplasm.json#/$defs/Germplasm",
                    "description": "the unique identifier for a germplasm",
                    "referencedAttribute": "parent1Childs",
                    "relationshipType": "many-to-one"
                },
                "parent2": {
                    "$ref": "Germplasm.json#/$defs/Germplasm",
                    "description": "the unique identifier for a germplasm",
                    "referencedAttribute": "parent2Childs",
                    "relationshipType": "many-to-one"
                },
               "pollinationEvents": {
                    "referencedAttribute": "cross",
                    "relationshipType": "one-to-many",
                    "items": {
                        "$ref": "CrossPollinationEvent.json#/$defs/CrossPollinationEvent",
                        "description": "The list of pollination events that occurred for this cross"
                    },
                    "type": [
                        "null",
                        "array"
                    ]
                }
            },
            "required": [
                "crossDbId"
            ],
            "title": "Cross",
            "type": "object"
        }
    },
    "$id": "https://brapi.org/Specification/BrAPI-Schema/BrAPI-Germplasm/Cross.json",
    "$schema": "http://json-schema.org/draft/2020-12/schema"
}

We created the following models:

CrossAttributes:

{
    "$defs": {
        "CrossAttribute": {
            "properties": {
                "cross_attribute_ID": {
                    "description": "the unique identifier for a cross attribute",
                    "type": "string"
                },
                "crossAttributeName": {
                    "description": "the human readable name of a cross attribute",
                    "type": [
                        "null",
                        "string"
                    ]
                },
                "crossAttributeValue": {
                    "description": "the value of a cross attribute",
                    "type": [
                        "null",
                        "string"
                    ]
                },
                "cross": {
                    "$ref": "Cross.json#/$defs/Cross",
                    "description": "The unique identifier for a Cross",
                    "referencedAttribute": "crossAttributes",
                    "relationshipType": "many-to-one"
                }
            },
            "required": [
                "cross_attribute_ID"
            ],
            "title": "CrossAttribute",
            "type": "object"
        }
    },
    "$id": "https://brapi.org/Specification/BrAPI-Schema/BrAPI-Germplasm/CrossAttribute.json",
    "$schema": "http://json-schema.org/draft/2020-12/schema"
}

CrossExternalReferences

{
    "$defs": {
        "CrossExternalReferences": {
            "properties": {
                "reference_ID": {
                    "description": "The external reference ID. Could be a simple string or a URI.",
                    "type": [
                        "null",
                        "string"
                    ]
                },
                "referenceSource": {
                    "description": "An identifier for the source system or database of this reference",
                    "type": [
                        "null",
                        "string"
                    ]
                },
                "cross": {
                    "$ref": "Cross.json#/$defs/Cross",
                    "description": "The unique identifier for a Cross",
                    "referencedAttribute": "externalReferences",
                    "relationshipType": "many-to-one"
                }
            },
            "required": [
                "reference_ID"
            ],
            "title": "CrossExternalReferences",
            "type": "object"
        }
    },
    "$id": "https://brapi.org/Specification/BrAPI-Schema/BrAPI-Germplasm/Cross.json",
    "$schema": "http://json-schema.org/draft/2020-12/schema"
}

CrossPollinationEvent

{
    "$defs": {
        "CrossPollinationEvent": {
            "properties": {
                "pollination_ID": {
                    "description": "The unique identifier for this pollination event",
                    "type": [
                        "null",
                        "string"
                    ]
                },
                "pollinationSuccessful": {
                    "description": "True if the pollination was successful",
                    "type": [
                        "null",
                        "boolean"
                    ]
                },
                "pollinationTimeStamp": {
                    "description": "The timestamp when the pollination took place",
                    "format": "date-time",
                    "type": [
                        "null",
                        "string"
                    ]
                },
                "cross": {
                    "$ref": "Cross.json#/$defs/Cross",
                    "description": "The unique identifier for a Cross",
                    "referencedAttribute": "pollinationEvents",
                    "relationshipType": "many-to-one"
                }

            },
            "required": [
                "pollination_ID"
            ],
            "title": "CrossPollinationEvent",
            "type": "object"
        }
    },
    "$id": "https://brapi.org/Specification/BrAPI-Schema/BrAPI-Germplasm/CrossPollinationEvent.json",
    "$schema": "http://json-schema.org/draft/2020-12/schema"
}

In the original Cross model, there were 2 special nested properties "parent1" and "parent2". Those properties were basically just a association to Germplasm to link the parents. Instead of creating a separate model for those 2 properties, we just created an association to Germplasm.json Cross:

               "parent1": {
                    "$ref": "Germplasm.json#/$defs/Germplasm",
                    "description": "the unique identifier for a germplasm",
                    "referencedAttribute": "parent1Childs",
                    "relationshipType": "many-to-one"
                },
                "parent2": {
                    "$ref": "Germplasm.json#/$defs/Germplasm",
                    "description": "the unique identifier for a germplasm",
                    "referencedAttribute": "parent2Childs",
                    "relationshipType": "many-to-one"
                },

Germplasm.json

               "parent1Childs": {
                    "title": "parent1Childs",
                    "description": "Childs of the germplasm",
                    "referencedAttribute": "parent1",
                    "relationshipType": "one-to-many",
                    "items": {
                        "$ref": "Cross.json#/$defs/Cross",
                        "description": "Crosses"
                    },
                    "type": [
                        "null",
                        "array"
                    ]
                },
                "parent2Childs": {
                    "title": "parent2Childs",
                    "description": "Childs of the germplasm",
                    "referencedAttribute": "parent2",
                    "relationshipType": "one-to-many",
                    "items": {
                        "$ref": "Cross.json#/$defs/Cross",
                        "description": "Crosses"
                    },
                    "type": [
                        "null",
                        "array"
                    ]
                }

Way of standardizing primary and foreign keys

Currently primary and foreign keys are defined the same way, e.g. from Cross:

{
    "$defs": {
        "Cross": {
            "properties": {
                "crossDbId": {
                    "description": "the unique identifier for a cross",
                    "type": "string"
                },
                "parent1": {
                    "properties": {
                        "germplasmDbId": {
                            "description": "the unique identifier for a germplasm",
                            "type": [
                                "null",
                                "string"
                            ]
                        }
                    },
                    "type": [
                        "null",
                        "object"
                    ]
                }
            },
            "required": [
                "crossDbId"
            ],
            "title": "Cross",
            "type": "object"
        }
    },
    "$id": "https://brapi.org/Specification/BrAPI-Schema/BrAPI-Germplasm/Cross.json",
    "$schema": "http://json-schema.org/draft/2020-12/schema"
}

The primary key crossDbId and foreign key germplasmDbId are defined the same way. In our project we defined primary keys like [model]_ID.

And for foreign keys we used a similar pattern, for example I use listOwnerPerson from List.json:

                "listOwnerPerson": {
                    "$ref": "Person.json#/$defs/Person",
                    "description": "The unique identifier for a List Owner. (usually a user or person)",
                    "referencedAttribute": "lists",
                    "relationshipType": "many-to-one"
                },

So basically one person can have multiple lists, in Zendro we would define the relationship like:

        "listOwnerPerson": {
            "type": "many_to_one",
            "implementation": "foreignkeys",
            "reverseAssociation": "lists",
            "target": "Person",
            "targetKey": "lists_ids",
            "sourceKey": "list_owner_person_id",
            "keysIn": "List",
            "targetStorageType": "sql"
        }

So our foreign keys are named after the attribute and uses id/ids, depending if it's an array or not.


Standardizing a way of defining associations

Currently BrAPI is using two different ways to define associations. X-to-many always has the items tag where a description and the reference is noted:

                "observationUnits": {
                    "title": "observationUnits",
                    "description": "observationUnits",
                    "referencedAttribute": "cross",
                    "relationshipType": "one-to-many",
                    "items": {
                        "$ref": "ObservationUnit.json#/$defs/ObservationUnit",
                        "description": "ObservationUnit"
                    },
                    "type": [
                        "null",
                        "array"
                    ]
                }

On the other side many-to-X don't have this nesting

                "crossingProject": {
                    "$ref": "CrossingProject.json#/$defs/CrossingProject",
                    "description": "the unique identifier for a crossing project",
                    "referencedAttribute": "crosses",
                    "relationshipType": "many-to-one"
                },

We don't see a benefit in nesting the reference and giving it a separate description. Basically you could define this relationship without nesting, like:

                "observationUnits": {
                    "title": "observationUnits",
                    "description": "observationUnits",
                    "referencedAttribute": "cross",
                    "relationshipType": "one-to-many",
                    "$ref": "ObservationUnit.json#/$defs/ObservationUnit",
                    "type": [
                        "null",
                        "array"
                    ]
                }