Representation of relations in on-disk format

bussec commented 4 years ago

These thoughts are based on our discussions around #409 and other single-cell stuff. However, as they are rather generic in nature I created a new ticket. As I assume that issues like these will come up in the future, my question is whether we can come up with default strategies how to address them?

There are various levels of representation that need to be consistent with each other:

The abstract Data Schema. This is basically a graph in which nodes are entities and edges are relations between them (for simplification I do not distinguish between entities and concepts as they are the same at this level).
The schema.yaml, which is the representations of the Data Schema in a file under the constrains of OpenAPI. In here, entities are represented as objects and relations as _id attributes to these objects. Whether a given _id is a primary or foreign key is not explictly specified, but naming convention (e.g., primary key is always <object_name>_id) should be good enough for now.
The TSV+JSON representation aka the "DataRep standard". Here entities become rows (for TSV) or records (JSON).

In my perception, the representation of entities across these levels is by and large ok, but it's the relations that give us trouble:

Part of the problem is the question whether schema.yaml is a representation of the Data Schema (as described above) or a template for TSV+JSON or both.
The representation of relations on the TSV+JSON level MUST be sufficient reconstruct all relations potentially present on the abstract level. While doing this, the representation SHOULD be a simple as possible.
Unidirectional linkage between entities (e.g., a Cell record referencing a Rearrangement via its rearrangement_id without a reciprocal reference being present in the Rearrangement) on the TSV+JSON is sufficient to fulfill point (2). Bidirectional linkage creates potential problems of consistency and thus SHOULD be avoided. There MAY be exceptions to this based on performance (but see (4) first).
If we accept multiple ways to represent relations (i.e., referencing from A to B, or B to A or both) we create variants, the number of which will grow exponentially with the number of entities. This creates numerous opportunities to generate ambigous output and therefore would put the burden on the side of software that reads and writes the TSV+JSON. In addition, not all implementors will be aware of the full repercussions of their chosen design. For once Freedom is Slavery, therefore we should settle to do this in only a single way.
It is unlikely that there will be any general rule to always represent relations in a specific file or file type (e.g., "always in the JSON" vs. "always in the TSV") as the best implementation depends on multiple factors:
- type of relation (1:1, 1:N, N:N)
- restrictions on multiple reference based on format or parsing performance
- types of the linked entities (local vs. global, see (6))
Entities come in different types, which are sometimes not directly obvious to us (see discussions around Receptor):
- A local entity is something that exists within a study because it was somehow observed with its context and can only meaningfully be interpreted within this context.
- An global entity is something abstract that exists independent of an actual observation (it is probably the same thing as a concept within an ontology, but I am not 100% sure about this). It should be noted that while it MAY be stored remotely this is not necessarily the case, however even a local storage does not change its global type.
- local <-> global relations should be represented in the local entity.
Finally, for global entities in remote repositories we need to consider whether they can be created and/or modified there (this is the same problem as for ontologies). For relations, it does raise the question of the temporary local write-caching of global entities and the update procedures once the entity is accepted to the remote repository. If we do not want to bother with a local write-cache then we need to accept that we cannot provide the complete information (as some global objects might not exist), i.e. some relations will be NULL-ed on the side of the local entity.

bcorrie commented 4 years ago

Part of the problem is the question whether schema.yaml is a representation of the Data Schema (as described above) or a template for TSV+JSON or both.

@bussec I haven't quite absorbed all of the above, but it seems to me that if schema.yaml doesn't capture our Data Schema AND act as a precise description of our file formats (TSV and JSON) then we aren't doing a good job 8-)

schristley commented 4 years ago

Part of the problem is the question whether schema.yaml is a representation of the Data Schema (as described above) or a template for TSV+JSON or both.

@bussec I haven't quite absorbed all of the above, but it seems to me that if schema.yaml doesn't capture our Data Schema AND act as a precise description of our file formats (TSV and JSON) then we aren't doing a good job 8-)

Yes, but also no. The schema we use is JSON schema which means it can annotate and validate JSON documents, but it has little to say about file formats, especially TSV, binary formats, and so on. However, the JSON specification describes the file format, so with the combination of the two, we have a precise description but only for JSON. There's no "TSV schema" as far as I know like we have for JSON. Mostly we do what everybody else does, which is to document the TSV with a set of descriptive rules and constraints, in particular we interpret certain JSON schema constructs (properties are columns names, nullable is blank string, etc.).

To answer your question @bussec , for JSON we have both, but not for TSV and other formats.

One thing I guess we are missing is a schema for the repertoire metadata file itself, maybe we should have something like this: if we want to be complete:

RepertoireMetadataFile:
    type: object
    properties:
        Info:
            $ref: '#/Info'
            description: Info object
        Repertoire:
            type: array
            description: List of Repertoire objects
            items:
                $ref: '#/Repertoire'

schristley commented 4 years ago

In my perception, the representation of entities across these levels is by and large ok, but it's the relations that give us trouble:

@bussec Yes, and this goes to the purpose of the JSON schema which is to just describe entities, i.e. singular JSON documents. Specification of relations between JSON documents is out of its scope. If we want to be more formal about the relations, which we probably should #347 , maybe JSON-LD is worth looking into.

scharch commented 4 years ago

It's not formal/rigorous (at least not yet), but I thought on the last call we had said that the Repertoire/metadata file would be responsible for describing relationships: this Rearrangements file(s) goes with that Cell file and those Clones. That seems like exactly what JSON-LD is for, so +1 for that.

schristley commented 4 years ago

Well I did some reading up on JSON-LD. LD stands for Linking Data, which I find misleading. Of course, I assumed that meant it was about linking JSON data objects, but apparently that's not what it does. Or that's one of the things it does but it's more than that. What it really seems to be is a way to semantically resolve data with different syntax, primarily by embedding schema/metadata. An example is the best way to illustrate.

Say we have the concept of a Person. Now every website might implement that concept differently with different data elements. We have Person in MiAIRR with the study data collector (collected_by) and data submitter (submitted_by) fields. Twitter has its own fields to represent a Person, so does Facebook and so forth. So what if we wanted to do queries on Person without having to handle all of the different ways it could be implemented in data elements. Well that's what JSON-LD does. It uses schema.org to define a large ontology of concepts, like Person, and allows you to annotate (link) the fields on your website to those concepts. Essentially instead of having just collected_by, we would have a large JSON object that "describes" what collected_by really is.

Now after saying all that, I'm not completely incorrect either. JSON-LD as a core concept has IRI's (Internationalized Resource Identifiers). As it says in the specification, "IRIs can often be confused with URLs (Uniform Resource Locators), the primary distinction is that a URL locates a resource on the web, an IRI identifies a resource. While it is a good practice for resource identifiers to be dereferenceable, sometimes this is not practical." Most of the examples I read show using IRI's to link to concepts, versus to be direct links to other JSON object, but an IRI is general as a "resource" can be just about anything.

So we can in essence use JSON-LD to describe our relations, but it's also used to describe entities.

scharch commented 4 years ago

Part of the problem is the question whether schema.yaml is a representation of the Data Schema (as described above) or a template for TSV+JSON or both.

Reading this again in light of today's call, it seems to me that the biggest issue to be addressed here is that schema.yaml is currently mostly only a template for the TSV+JSON. Points 4-7 seem to me to be mostly about how we think about implementing points 2 and 3 in the "on-disk" representation. @bussec am I on the right track yet?

bussec commented 4 years ago

@bussec am I on the right track yet?

@scharch Yes!

schristley commented 4 years ago

For me, the representations of relations on-disk isn't the direct problem per se. The underlying question is how do we want (or allow) users to perform operations utilizing these relationships?

For example, with Cell <-> Rearrangements, we can imagine asking two different questions. 1) Given a set of rearrangements, what cells do they belong to, or 2) Given a set of cells, what rearrangements belong to them.

When it comes to on-disk format, where the relationship is stored determines how easy the question can be answered.

Let's say just cell_id is stored in Rearrangements. That makes 1) easy to answer because you just look at cell_id for your given set of rearrangements. However 2) is harder because you cannot just look at your Cells to get the answer, you have to search across all of the rearrangements to find the ones with the right cell_id.

Now let's say just the array of rearrangement_ids is stored in Cell. Well that flips the ease of the questions. Now 1) is harder because you have to search through each Cell's array for the right rearrangement_id, while 2) is easy because the rearrangements are immediately available in the Cell.

Do we even know if both questions will be asked, or if one is more common than the other? When we create our on-disk formats, do we want to bias the ease/difficulty of answering one question over the other?

In general, for a 1-to-N relationship in an on-disk format, the (space) efficient representation is to put the relation value within the N-side entity. For example, with Cell <- 1 - to - N -> Rearrangements, that would imply a single cell_id field in Rearrangements. If you place the relation value on the 1-side entity, you essentially put a table (of size N) in that 1-side entity, i.e. an array of rearrangement_id in Cell.

If we follow that space efficient design, then we consistently bias making 1) easy and 2) hard. If it ends up that 2) is a very common analysis option, we inherently make that a computationally expensive process.

We could, of course, require the relation to be stored on both sides, thus making both questions easy. However, this is where the conflict comes in with the ADC API. The ADC API is necessarily intertwined because the response from an API request is assumed to exactly correspond to the on-disk format.

If we require the relation on both sides, that means we require 1) that users can perform queries on that relationship with either the /cell API or the /rearrangement API and 2) that the data that comes back from the /cell and /rearrangement API contains that relationship.

Why is that problem? Well technically it shouldn't be because that relationship should be there when the data is loaded into the ADC. However, that's not the reality. The reality is that rearrangement data is generated and loaded first, without relationship info to clones, cells and etc. Only later will clone data be generated. That clone data will have the relationship to the rearrangements and thus can be loaded, but to put the relationship on the other side, in the rearrangements, requires updating and/or reloading all of the rearrangement data.

Oh, and that space efficient design above? That's on the rearrangement side so yes, that still requires updating and/or reloading all of the rearrangement data.

Thus for a practical manner for the ADC, having the relationship on just the one side, to make 2) easy, is also easier from a data management perspective.

So I suppose the last question is to @bcorrie and others who run repositories, are you okay with updating/reloading all of your rearrangement data? Is that just part of the cost of running a repository?

bcorrie commented 9 months ago

I am not sure what is the expected or hoped for output of this discussion, but it definitely doesn't feel like a v2.0 discussion - unless some of it is resolved in the Manifest or RepertoireGroup discussion. 8-)

It was an interesting read at least 8-)

Or maybe it is resolved, given that we have _id fields that link our various schema objects - for better or for worse as they are now.

bcorrie commented 9 months ago

As to what do we do when we load Rearrangement/Clone/Cell/Expression data - yes we update the internal linkages between all of the relevant entities!!!

https://github.com/sfu-ireceptor/turnkey-service-php#resolving-internal-data-linkages

So yes, it is a cost of loading data into a repository. The good news is you only have to do it when you load the data.

scharch commented 9 months ago

I am not sure what is the expected or hoped for output of this discussion, but it definitely doesn't feel like a v2.0 discussion - unless some of it is resolved in the Manifest or RepertoireGroup discussion. 8-)

Some part of it will be resolved by Manifest, but maybe not the whole thing...

schristley commented 9 months ago

The underlying issue is that JSON Schema does not have an explicit representation for relations between objects other than inheritance and composition. Unlike for example in LinkML where you can specify what it points to (the range) and the cardinality of the relation.

So from that sense we are not going to solve this other than decide upon a convention and best practices...

bcorrie commented 9 months ago

So from that sense we are not going to solve this other than decide upon a convention and best practices...

From a "conventions and best practices" perspective, is this not already solved. We have designated _id fields that are links between the objects, and the cardinality of those relationships are documented here: https://docs.airr-community.org/en/stable/datarep/overview.html#relationship-between-schema-objects

So it seems to me like the "only" things missing are:

A formal way of encoding what those expected relationships are in the specification
A way of grouping things on disk so that you know which entities belong together for analysis (Manifest)

scharch commented 9 months ago

is #672 relevant here?

bcorrie commented 9 months ago

is #672 relevant here?

Possibly, in the JSON relationship sense, but that doesn't seem like a v2.0 thing that we are likely to resolve.

We have on disk file formats for our schema objects
We have _id fields that link schema objects
We have documentation on the intended cardinality of the relationships
We have a mechanism for grouping files that store different types of data into a data set (Manifest)

For 2.0 do we want anything more?

schristley commented 9 months ago

is #672 relevant here?

It is a convention but probably not one we should follow. I haven't seen any followup that suggests this is becoming commonly used, or supported by more tools, etc.

schristley commented 9 months ago

As noted, JSON schema itself does not have explicit representation of relations between objects. This is something that the AKC project will be defining by using LinkML, so we will see what mechanisms that we can utilize when integrating back with AIRR standards.

airr-community / airr-standards

Representation of relations in on-disk format #439