Closed gszep closed 1 year ago
Happy new year! ⭐ Any thoughts on this?
@gszep your request is not in line with the current architecture of the AIRR Schema:
SampleProcessing
is referenced by Repertoire
SampleProcessing
) and data processing (represented by the respectively named properties in Repertoire
).SampleProcessing
should not contain information on Rearrangements
Hi @gszep It might be helpful for you to explain what you are trying to accomplish, and we can describe how to do that with the current AIRR data model.
This way each sample is linked to a rearrangement in a 1 to 1 relationship 🙏🏼
This relationship already exists, except it is in the rearrangement record. In the rearrangement, the repertoire_id
and sample_processing_id
uniquely define the SampleProcessing
for that rearrangement.
@bussec A Repertoire
appears to have a field which is of type List[SampleProcessing]
. Can you clarify how one would store longitudinal samples at different timepoints from the same subject? As multiple samples in List[SampleProcessing]
under the same repertoire or does each sample get its own Repertoire
? What about different sequencing runs from the same biological sample?
@schristley I am writing an AIRR-compliant HDF5 file format where
This facilitates lazy, parallel constant-time random access to metadata rearrangement data. Since the metadata and rearrangements are stored together (HDF5 must be self-describing) I need a place to place the rearrangement data somewhere (as a field somewhere accessible within Repertoire
). No joins are necessary as this file format is lazy
@bussec A
Repertoire
appears to have a field which is of typeList[SampleProcessing]
. Can you clarify how one would store longitudinal samples at different timepoints from the same subject? As multiple samples inList[SampleProcessing]
under the same repertoire or does each sample get its ownRepertoire
? What about different sequencing runs from the same biological sample?
The AIRR Standard was designed to be able to capture both cases, with the structure of the Repertoire object flexible enough to do that. The original design was done this way because we did not want to dictate the structural relationships between these objects and instead leave it up to the study designer and/or data curator. The Repertoire object currently has two definitions/uses (yes this is a flaw in the design), it was initially though of in the more biological sense, where it would represent all of the b-cell/t-cell in a single subject at current time point. But it is also used as a general grouping of samples in a variety of different ways.
So at least from a standards definition you can do what makes sense to you. For example, in the iReceptor repositories we have a 1:1 relationship between Repertoire and Sample (the List[SampleProcessing]
is always of length 1). We do this because of its simplicity.
If you are writing code that handles general AIRR Repertoire data as input, you have to handle all cases - which can be quite challenging...
Can you clarify how one would store longitudinal samples at different timepoints from the same subject?
This is what RepertoireGroup
is for. It is still experimental and will be revised/further developed in the (near-ish?) future, but the intent is exactly to solve this kind of problem.
@schristley I am writing an AIRR-compliant HDF5 file format where
This is interesting. I've used HDF5 a few times but only as user, and only with small datasets. How large can HDF5 files become, multi TB?
I know the HD means hierarchical data, but not sure what flexibility it has to represent data models. Is it as flexible like JSON? Does HDF5 prefer data to be in normal form, like in an SQL database?
As there is no AIRR standard HDF5 file format, you cannot technically be AIRR-compliant ;-D but I assume you mean to be "compliant" with the AIRR Data Model. The AIRR Data Model is flexible enough that it can be optimized for different file formats. Can HDF5 represent compound indexes like SQL? That is, most AIRR objects require at least two identifiers, for rearrangements that's repertoire_id
and sequence_id
.
- metadata are stored as attributes and groups
Ok, I don't what HDF5 attributes and groups are, but a quick google shows me that looks like best practice. How are you handling the JSON? Are you using some tool like this or is there a better representation?
- rearrangement fields are stored as one dimensional datasets
one dimensional? hmm, the sequence_id
is the dimension?
This facilitates lazy, parallel constant-time random access to metadata rearrangement data.
This sounds like a hash function. Also random access implies lookup by sequence_id
The sequence_id
is not globally unique, according to AIRR it is only unique within a repertoire, ie. repertoire_id
, thus why you need both identifiers.
However, just the two identifiers is making one assumption, which is that all repertoires only have a single DataProcessing
object. If a repertoire has multiple data processings, they are essential duplicates of each other (same initial data run through different tools), and you don't normally want them mixed, so that would require using the data_processing_id
in rearrangements to distinguish them. This is pretty rare and still an active topic for AIRR standards.
Since the metadata and rearrangements are stored together (HDF5 must be self-describing) I need a place to place the rearrangement data somewhere (as a field somewhere accessible within
Repertoire
). No joins are necessary as this file format is lazy
Where does the user get the list of identifiers in the first place? Some repertoires can have millions of rearrangements, so we don't want to store this list of (compound?) identifiers in the Repertoire object, at least not for file formats like JSON. That blows up the size. But maybe this is reasonable for HDF5?
Could you give a code example to do a simple HDF5 metadata query on repertoires, like on subject age or sex, then using that access the rearrangements? Or at least how you are imagining it work. Something like what we have in the python docs.
There has been some recent dialog about storing AIRR Data in the h5ad file format, which is an HDF5 based format for storing Anndata files (https://anndata.readthedocs.io/en/latest/index.html), which in turn is used in packages like scirpy and other scverse packages (https://scverse.org/) to process single-cell omics files. If you are looking at storing AIRR Data in HDF5, you might want to look at what is going on in this area as they might have solved this - certainly a fair bit of thought has gone into this. The nice thing is that if you are also consider single-cell data, this gets you both!
I should add that we are using the h5ad format in our single-cell data export and analysis of Single-cell data from the AIRR Data Commons. We are not using the AIRR extension - we are just storing the Cell/GEX in an h5ad file as it facilitates analysis with tools like Conga and CellTypist (which we currently have integrated into the iReceptor Gateway (https://gateway.ireceptor.org)).
I would like to request a
rearrangement
field inSampleProcessing
as followsThis way each sample is linked to a rearrangement in a 1 to 1 relationship 🙏🏼