Open bussec opened 4 years ago
- Part of the problem is the question whether
schema.yaml
is a representation of the Data Schema (as described above) or a template forTSV+JSON
or both.
@bussec I haven't quite absorbed all of the above, but it seems to me that if schema.yaml
doesn't capture our Data Schema AND act as a precise description of our file formats (TSV and JSON) then we aren't doing a good job 8-)
- Part of the problem is the question whether
schema.yaml
is a representation of the Data Schema (as described above) or a template forTSV+JSON
or both.@bussec I haven't quite absorbed all of the above, but it seems to me that if
schema.yaml
doesn't capture our Data Schema AND act as a precise description of our file formats (TSV and JSON) then we aren't doing a good job 8-)
Yes, but also no. The schema we use is JSON schema which means it can annotate and validate JSON documents, but it has little to say about file formats, especially TSV, binary formats, and so on. However, the JSON specification describes the file format, so with the combination of the two, we have a precise description but only for JSON. There's no "TSV schema" as far as I know like we have for JSON. Mostly we do what everybody else does, which is to document the TSV with a set of descriptive rules and constraints, in particular we interpret certain JSON schema constructs (properties are columns names, nullable is blank string, etc.).
To answer your question @bussec , for JSON we have both, but not for TSV and other formats.
One thing I guess we are missing is a schema for the repertoire metadata file itself, maybe we should have something like this: if we want to be complete:
RepertoireMetadataFile:
type: object
properties:
Info:
$ref: '#/Info'
description: Info object
Repertoire:
type: array
description: List of Repertoire objects
items:
$ref: '#/Repertoire'
In my perception, the representation of entities across these levels is by and large ok, but it's the relations that give us trouble:
@bussec Yes, and this goes to the purpose of the JSON schema which is to just describe entities, i.e. singular JSON documents. Specification of relations between JSON documents is out of its scope. If we want to be more formal about the relations, which we probably should #347 , maybe JSON-LD is worth looking into.
It's not formal/rigorous (at least not yet), but I thought on the last call we had said that the Repertoire/metadata file would be responsible for describing relationships: this Rearrangements file(s) goes with that Cell file and those Clones. That seems like exactly what JSON-LD is for, so +1 for that.
Well I did some reading up on JSON-LD. LD stands for Linking Data, which I find misleading. Of course, I assumed that meant it was about linking JSON data objects, but apparently that's not what it does. Or that's one of the things it does but it's more than that. What it really seems to be is a way to semantically resolve data with different syntax, primarily by embedding schema/metadata. An example is the best way to illustrate.
Say we have the concept of a Person
. Now every website might implement that concept differently with different data elements. We have Person
in MiAIRR with the study data collector (collected_by
) and data submitter (submitted_by
) fields. Twitter has its own fields to represent a Person, so does Facebook and so forth. So what if we wanted to do queries on Person
without having to handle all of the different ways it could be implemented in data elements. Well that's what JSON-LD does. It uses schema.org to define a large ontology of concepts, like Person, and allows you to annotate (link) the fields on your website to those concepts. Essentially instead of having just collected_by
, we would have a large JSON object that "describes" what collected_by
really is.
Now after saying all that, I'm not completely incorrect either. JSON-LD as a core concept has IRI's (Internationalized Resource Identifiers). As it says in the specification, "IRIs can often be confused with URLs (Uniform Resource Locators), the primary distinction is that a URL locates a resource on the web, an IRI identifies a resource. While it is a good practice for resource identifiers to be dereferenceable, sometimes this is not practical." Most of the examples I read show using IRI's to link to concepts, versus to be direct links to other JSON object, but an IRI is general as a "resource" can be just about anything.
So we can in essence use JSON-LD to describe our relations, but it's also used to describe entities.
- Part of the problem is the question whether
schema.yaml
is a representation of the Data Schema (as described above) or a template forTSV+JSON
or both.
Reading this again in light of today's call, it seems to me that the biggest issue to be addressed here is that schema.yaml
is currently mostly only a template for the TSV+JSON
. Points 4-7 seem to me to be mostly about how we think about implementing points 2 and 3 in the "on-disk" representation. @bussec am I on the right track yet?
@bussec am I on the right track yet?
@scharch Yes!
For me, the representations of relations on-disk isn't the direct problem per se. The underlying question is how do we want (or allow) users to perform operations utilizing these relationships?
For example, with Cell <-> Rearrangements, we can imagine asking two different questions. 1) Given a set of rearrangements, what cells do they belong to, or 2) Given a set of cells, what rearrangements belong to them.
When it comes to on-disk format, where the relationship is stored determines how easy the question can be answered.
Let's say just cell_id
is stored in Rearrangements. That makes 1) easy to answer because you just look at cell_id
for your given set of rearrangements. However 2) is harder because you cannot just look at your Cells to get the answer, you have to search across all of the rearrangements to find the ones with the right cell_id
.
Now let's say just the array of rearrangement_id
s is stored in Cell. Well that flips the ease of the questions. Now 1) is harder because you have to search through each Cell's array for the right rearrangement_id
, while 2) is easy because the rearrangements are immediately available in the Cell.
Do we even know if both questions will be asked, or if one is more common than the other? When we create our on-disk formats, do we want to bias the ease/difficulty of answering one question over the other?
In general, for a 1-to-N relationship in an on-disk format, the (space) efficient representation is to put the relation value within the N-side entity. For example, with Cell <- 1 - to - N -> Rearrangements, that would imply a single cell_id
field in Rearrangements. If you place the relation value on the 1-side entity, you essentially put a table (of size N) in that 1-side entity, i.e. an array of rearrangement_id
in Cell.
If we follow that space efficient design, then we consistently bias making 1) easy and 2) hard. If it ends up that 2) is a very common analysis option, we inherently make that a computationally expensive process.
We could, of course, require the relation to be stored on both sides, thus making both questions easy. However, this is where the conflict comes in with the ADC API. The ADC API is necessarily intertwined because the response from an API request is assumed to exactly correspond to the on-disk format.
If we require the relation on both sides, that means we require 1) that users can perform queries on that relationship with either the /cell API or the /rearrangement API and 2) that the data that comes back from the /cell and /rearrangement API contains that relationship.
Why is that problem? Well technically it shouldn't be because that relationship should be there when the data is loaded into the ADC. However, that's not the reality. The reality is that rearrangement data is generated and loaded first, without relationship info to clones, cells and etc. Only later will clone data be generated. That clone data will have the relationship to the rearrangements and thus can be loaded, but to put the relationship on the other side, in the rearrangements, requires updating and/or reloading all of the rearrangement data.
Oh, and that space efficient design above? That's on the rearrangement side so yes, that still requires updating and/or reloading all of the rearrangement data.
Thus for a practical manner for the ADC, having the relationship on just the one side, to make 2) easy, is also easier from a data management perspective.
So I suppose the last question is to @bcorrie and others who run repositories, are you okay with updating/reloading all of your rearrangement data? Is that just part of the cost of running a repository?
I am not sure what is the expected or hoped for output of this discussion, but it definitely doesn't feel like a v2.0 discussion - unless some of it is resolved in the Manifest
or RepertoireGroup
discussion. 8-)
It was an interesting read at least 8-)
Or maybe it is resolved, given that we have _id
fields that link our various schema objects - for better or for worse as they are now.
As to what do we do when we load Rearrangement/Clone/Cell/Expression data - yes we update the internal linkages between all of the relevant entities!!!
https://github.com/sfu-ireceptor/turnkey-service-php#resolving-internal-data-linkages
So yes, it is a cost of loading data into a repository. The good news is you only have to do it when you load the data.
I am not sure what is the expected or hoped for output of this discussion, but it definitely doesn't feel like a v2.0 discussion - unless some of it is resolved in the
Manifest
orRepertoireGroup
discussion. 8-)
Some part of it will be resolved by Manifest
, but maybe not the whole thing...
The underlying issue is that JSON Schema does not have an explicit representation for relations between objects other than inheritance and composition. Unlike for example in LinkML where you can specify what it points to (the range) and the cardinality of the relation.
So from that sense we are not going to solve this other than decide upon a convention and best practices...
So from that sense we are not going to solve this other than decide upon a convention and best practices...
From a "conventions and best practices" perspective, is this not already solved. We have designated _id
fields that are links between the objects, and the cardinality of those relationships are documented here: https://docs.airr-community.org/en/stable/datarep/overview.html#relationship-between-schema-objects
So it seems to me like the "only" things missing are:
Manifest
)is #672 relevant here?
is #672 relevant here?
Possibly, in the JSON relationship sense, but that doesn't seem like a v2.0 thing that we are likely to resolve.
_id
fields that link schema objectsFor 2.0 do we want anything more?
is #672 relevant here?
It is a convention but probably not one we should follow. I haven't seen any followup that suggests this is becoming commonly used, or supported by more tools, etc.
As noted, JSON schema itself does not have explicit representation of relations between objects. This is something that the AKC project will be defining by using LinkML, so we will see what mechanisms that we can utilize when integrating back with AIRR standards.
These thoughts are based on our discussions around #409 and other single-cell stuff. However, as they are rather generic in nature I created a new ticket. As I assume that issues like these will come up in the future, my question is whether we can come up with default strategies how to address them?
There are various levels of representation that need to be consistent with each other:
schema.yaml
, which is the representations of the Data Schema in a file under the constrains of OpenAPI. In here, entities are represented as objects and relations as_id
attributes to these objects. Whether a given_id
is a primary or foreign key is not explictly specified, but naming convention (e.g., primary key is always<object_name>_id
) should be good enough for now.TSV+JSON
representation aka the "DataRep standard". Here entities become rows (forTSV
) or records (JSON
).In my perception, the representation of entities across these levels is by and large ok, but it's the relations that give us trouble:
schema.yaml
is a representation of the Data Schema (as described above) or a template forTSV+JSON
or both.TSV+JSON
level MUST be sufficient reconstruct all relations potentially present on the abstract level. While doing this, the representation SHOULD be a simple as possible.Cell
record referencing aRearrangement
via itsrearrangement_id
without a reciprocal reference being present in theRearrangement
) on theTSV+JSON
is sufficient to fulfill point (2). Bidirectional linkage creates potential problems of consistency and thus SHOULD be avoided. There MAY be exceptions to this based on performance (but see (4) first).TSV+JSON
. In addition, not all implementors will be aware of the full repercussions of their chosen design. For once Freedom is Slavery, therefore we should settle to do this in only a single way.Receptor
):