can we have a property `rearrangement` in the `SampleProcessing` object?

gszep commented 1 year ago

I would like to request a rearrangement field in SampleProcessing as follows

SampleProcessing:
    discriminator: AIRR
    allOf:
        - type: object
          properties:
              sample_processing_id:
                  type: string
                  description: >
                      Identifier for the sample processing object. This field should be unique within the repertoire.
                      This field can be used to uniquely identify the combination of sample, cell processing,
                      nucleic acid processing and sequencing run information for the repertoire.
                  title: Sample processing ID
                  x-airr:
                      nullable: true
                      name: Sample processing ID
                      adc-query-support: true
                      identifier: true
              rearrangement:
                  type: array
                  description: List of rearrangement records
                  items:
                      $ref: '#/Rearrangement'
                  x-airr:
                      nullable: false
        - $ref: '#/Sample'
        - $ref: '#/CellProcessing'
        - $ref: '#/NucleicAcidProcessing'
        - $ref: '#/SequencingRun'

This way each sample is linked to a rearrangement in a 1 to 1 relationship 🙏🏼

gszep commented 1 year ago

Happy new year! ⭐ Any thoughts on this?

bussec commented 1 year ago

@gszep your request is not in line with the current architecture of the AIRR Schema:

SampleProcessing is referenced by Repertoire
A Repertoire is defined as a unique combination of a study, subject, sample (i.e. SampleProcessing) and data processing (represented by the respectively named properties in Repertoire).
Information about data processing is at a lower hierarchical level than information on sample processing
Therefore the same raw sequences (the results of a SequencingRun, which have no direct representation in the Schema) can be present in multiple repertoires, depending of the data processing.
Therefore SampleProcessing should not contain information on Rearrangements

schristley commented 1 year ago

Hi @gszep It might be helpful for you to explain what you are trying to accomplish, and we can describe how to do that with the current AIRR data model.

This way each sample is linked to a rearrangement in a 1 to 1 relationship 🙏🏼

This relationship already exists, except it is in the rearrangement record. In the rearrangement, the repertoire_id and sample_processing_id uniquely define the SampleProcessing for that rearrangement.

gszep commented 1 year ago

@bussec A Repertoire appears to have a field which is of type List[SampleProcessing]. Can you clarify how one would store longitudinal samples at different timepoints from the same subject? As multiple samples in List[SampleProcessing] under the same repertoire or does each sample get its own Repertoire? What about different sequencing runs from the same biological sample?

gszep commented 1 year ago

@schristley I am writing an AIRR-compliant HDF5 file format where

metadata are stored as attributes and groups
rearrangement fields are stored as one dimensional datasets

This facilitates lazy, parallel constant-time random access to metadata rearrangement data. Since the metadata and rearrangements are stored together (HDF5 must be self-describing) I need a place to place the rearrangement data somewhere (as a field somewhere accessible within Repertoire). No joins are necessary as this file format is lazy

bcorrie commented 1 year ago

@bussec A Repertoire appears to have a field which is of type List[SampleProcessing]. Can you clarify how one would store longitudinal samples at different timepoints from the same subject? As multiple samples in List[SampleProcessing] under the same repertoire or does each sample get its own Repertoire? What about different sequencing runs from the same biological sample?

The AIRR Standard was designed to be able to capture both cases, with the structure of the Repertoire object flexible enough to do that. The original design was done this way because we did not want to dictate the structural relationships between these objects and instead leave it up to the study designer and/or data curator. The Repertoire object currently has two definitions/uses (yes this is a flaw in the design), it was initially though of in the more biological sense, where it would represent all of the b-cell/t-cell in a single subject at current time point. But it is also used as a general grouping of samples in a variety of different ways.

So at least from a standards definition you can do what makes sense to you. For example, in the iReceptor repositories we have a 1:1 relationship between Repertoire and Sample (the List[SampleProcessing] is always of length 1). We do this because of its simplicity.

If you are writing code that handles general AIRR Repertoire data as input, you have to handle all cases - which can be quite challenging...

scharch commented 1 year ago

Can you clarify how one would store longitudinal samples at different timepoints from the same subject?

This is what RepertoireGroup is for. It is still experimental and will be revised/further developed in the (near-ish?) future, but the intent is exactly to solve this kind of problem.

schristley commented 1 year ago

@schristley I am writing an AIRR-compliant HDF5 file format where

This is interesting. I've used HDF5 a few times but only as user, and only with small datasets. How large can HDF5 files become, multi TB?

I know the HD means hierarchical data, but not sure what flexibility it has to represent data models. Is it as flexible like JSON? Does HDF5 prefer data to be in normal form, like in an SQL database?

As there is no AIRR standard HDF5 file format, you cannot technically be AIRR-compliant ;-D but I assume you mean to be "compliant" with the AIRR Data Model. The AIRR Data Model is flexible enough that it can be optimized for different file formats. Can HDF5 represent compound indexes like SQL? That is, most AIRR objects require at least two identifiers, for rearrangements that's repertoire_id and sequence_id.

metadata are stored as attributes and groups

Ok, I don't what HDF5 attributes and groups are, but a quick google shows me that looks like best practice. How are you handling the JSON? Are you using some tool like this or is there a better representation?

rearrangement fields are stored as one dimensional datasets

one dimensional? hmm, the sequence_id is the dimension?

This facilitates lazy, parallel constant-time random access to metadata rearrangement data.

This sounds like a hash function. Also random access implies lookup by sequence_id

The sequence_id is not globally unique, according to AIRR it is only unique within a repertoire, ie. repertoire_id, thus why you need both identifiers.

However, just the two identifiers is making one assumption, which is that all repertoires only have a single DataProcessing object. If a repertoire has multiple data processings, they are essential duplicates of each other (same initial data run through different tools), and you don't normally want them mixed, so that would require using the data_processing_id in rearrangements to distinguish them. This is pretty rare and still an active topic for AIRR standards.

Since the metadata and rearrangements are stored together (HDF5 must be self-describing) I need a place to place the rearrangement data somewhere (as a field somewhere accessible within Repertoire). No joins are necessary as this file format is lazy

Where does the user get the list of identifiers in the first place? Some repertoires can have millions of rearrangements, so we don't want to store this list of (compound?) identifiers in the Repertoire object, at least not for file formats like JSON. That blows up the size. But maybe this is reasonable for HDF5?

Could you give a code example to do a simple HDF5 metadata query on repertoires, like on subject age or sex, then using that access the rearrangements? Or at least how you are imagining it work. Something like what we have in the python docs.

bcorrie commented 1 year ago

There has been some recent dialog about storing AIRR Data in the h5ad file format, which is an HDF5 based format for storing Anndata files (https://anndata.readthedocs.io/en/latest/index.html), which in turn is used in packages like scirpy and other scverse packages (https://scverse.org/) to process single-cell omics files. If you are looking at storing AIRR Data in HDF5, you might want to look at what is going on in this area as they might have solved this - certainly a fair bit of thought has gone into this. The nice thing is that if you are also consider single-cell data, this gets you both!

See https://github.com/scverse/scirpy/issues/327

bcorrie commented 1 year ago

I should add that we are using the h5ad format in our single-cell data export and analysis of Single-cell data from the AIRR Data Commons. We are not using the AIRR extension - we are just storing the Cell/GEX in an h5ad file as it facilitates analysis with tools like Conga and CellTypist (which we currently have integrated into the iReceptor Gateway (https://gateway.ireceptor.org)).

airr-community / airr-standards

can we have a property `rearrangement` in the `SampleProcessing` object? #664