normalized data model for metadata

airr-community / airr-standards

AIRR Community Data Standards

https://docs.airr-community.org

Creative Commons Attribution 4.0 International

35 stars 23 forks source link

normalized data model for metadata #144

Closed schristley closed 5 years ago

schristley commented 5 years ago

The MiAIRR data elements are currently specified in a denormalized way, which introduces a lot of redundancy when thinking about storing it in a file format to be used by analysis tools, as well as providing data entry screens to manipulate it.

My initial attempt at a normalized data model (which is in VDJServer v1.1) follows the 1-to-n data model as originally documented in the yaml spec:

1-to-n relationship between a study and its subjects
1-to-n relationship between a subject and its diagnoses
1-to-n relationship between a subject and its samples
1-to-n relationship between a sample and processing of its cells
1-to-n relationship between a CellProcessing and processing of its nucleic acid

However, I discovered after using it for awhile that it still has a very large amount of redundancy. The relationship between Study/Subject/Sample is okay, it is really the relationship between Sample, Cell Processing and NucleicAcidProcessing where much of the redundancy occurs.

Let's focus on those three objects and use a simple example to illustrate. Say we have 50 blood samples for TCR sequencing where we do flow sorting to separate CD4 and CD8 cells. At the sequencing output, we have 100 FASTQ files with the raw data.

50 Sample records for the blood draws.
100 CellProcessing records to document the CD4 and CD8 flow sorting of samples.
100 NucleicAcidProcessing records to document the sequencing runs. In this case there are no technical replicates so its a 1-to-1 relationship with CellProcessing.

Notice that we did save redundancy by only having 50 Sample records instead of 100, but a large quantity of data in the CellProcessing and NucleicAcidProcessing records is redundant. The protocol to do the flow sorting is identical for each sample, so why do we need to detail it out 100 times? Likewise, the protocol for the sequencing step is also identical. What we would really want to enter is something like this:

50 Sample records for the blood draws.
2 CellProcessing records, one for CD4 and one for CD8 flow sorting.
1 NucleicAcidProcessing record to document the sequencing protocol.

Now we still need to link this information to the 100 FASTQ files we got from the sequencer, but this would be a simple tuple such as (filename, sample_id, cell_processing_id, nucleic_acid_processing_id) to link the pieces together.

So why not do this? Well, I was working on this very idea (@laserson also independently when thinking about a metadata file format, and CRWG for querying metadata) but ran into a roadblock. Specifically while some data elements are "general" and can be normalized, others are "specific" and thus cannot be normalized. So for the above example, in CellProcessing, cell_subset is general, it's "CD4 T-cell" and "CD80 T-cell" for the two records. However, cell_number is specific to each physical blob of cells being processed, so there are 100 values that need to be stored.

What's especially annoying is that for a bunch of studies that we've (i.e. VDJServer) made MiAIRR-compliant, we don't have information for most of those "specific" fields, so they are just "NA" or "unknown".

How to handle this? One simple, naive idea is to tag each data element as either "general" or "specific" so applications have a hint about which is which. Another idea is to re-structure the yaml spec to define a specific normalized data model.

bcorrie commented 5 years ago

OK, I am going to play devil's advocate here...

Do we need to define (and enforce) a structure for meta-data from a data representation point of view? I am not saying having some logical structure in the meta-data is bad (ids that link these types of meta-data), but I am not convinced that having a data representation (file format/API response) that represents the normalized data with a complex relational structure is necessary or even a good idea.

I say this for a couple of reasons:

1) Will we be able to agree on an entity relationship structure. I am far from an expert here, so I will bow to your wisdom, but in our earlier discussions I thought we came to the conclusion that the types of relationships are unlikely to be agreed upon. I know there are some inferred in the MiAIRR spec and denoted in the YAML file, but there are others as well I am sure... The one that jumps out to me with the above definition (again, I am no expert) is, the lack of a relationship between sample and diagnosis (I believe we have discussed this previously )? That is, wouldn't some people want to know what the diagnosis is when a given sample is taken? Also, I suspect there are a myriad of ways of defining relationships between entities at the sample/cell processing/nucleic acid processing levels that different groups might have...

2) Capturing a complex relationship in a file format (and an API response) makes it complex from a representation point of view and from a processing point of view. Flat, denormalized files are easy to process and easy to represent. Structure can be inferred from the denormalized data (in particular if the metadata has fields that represent these relationships)

3) These data sets are not huge, so some redundant information is not the end of the world. Is that naive? Compared to the scale of the sequence annotation data this data size is relatively small.

There seems to be a trade off here between rigorous definition/efficiency and ease of use/generality, especially for the data representation end of the spectrum. I can see wanting this structure in a repository (although even that isn't necessary), but I am not sure enforcing the structure in the data representation is required or desirable??? 8-)

schristley commented 5 years ago

Do we need to define (and enforce) a structure for meta-data from a data representation point of view?

Sorry, I should have put more context at the beginning of the issue. We were suppose to discuss this at the last DRWG call in regards to the metadata format but we used up all our time on 0 vs 1, but I have been reading the doc @laserson wrote in preparation. Read that if you haven't already, he weighs different proposals including denormalized vs normalized. My intent was to post my thoughts about normalized but I did not want to stuff if all in a google comment (or take over Uri's doc) so I stuck it all in this issue.

Now it may be that we decide on denormalized, and this issue is moot. I might argue that the points you bring up, as well as the challenges I detailed, seem to push us toward denormalized.

I do want to address one of your comments:

Structure can be inferred from the denormalized data (in particular if the metadata has fields that represent these relationships)

I'm not sure if this is true. If there are identifiers then I think it's possible, but if denormalized is the default (like in data entry screen, e.g. CEDAR) then there may never be identifiers.

The reason we are defining a metadata format is because we want to analysis tools to be interoperable, and be able to use that metadata as part of the analysis. Can we guarantee that proper relationships can be inferred in the denormalized format?

To give a specific example, given two rearrangement files both from the same subject but with different sampling, tissue and NA processing, sequencing, etc., can we insure that following the id links from a rearrangement record in each file will both go to the same subject? For that specific question, I think the answer is yes so long as unique subject_id's have been defined. We need to think carefully to make sure we can answer yes for all possibilities we can think of for the format.

bcorrie commented 5 years ago

I agree that there are some key relationships that we want to capture... and I think those critical ones should be defined in the MiAIRR standard - as several of them are now... They are captured in the sense that study, subject, sample each have unique identifiers within their "parent" classes. I am not sure we want to define much about the structure beyond that.

The good news is these structure fields are in MiAIRR then CEDAR and other tools should provide them. But I don't think we should try and be "complete" or even particularly "complex" in creating that structure and normalizing data.

I think this is probably the most critical when talking about data representation. My view is that we should probably avoid coming up with a complex metadata file format that tries to represent that hierarchy in a normalized form. I am a fan of simple data representations, with the exception of those cases when data duplication is going to cause a huge problem... I don't think the meta-data is large enough to go to great lengths to avoid the duplication that a flattened file format would cause... That is only my opinion, I could be convinced otherwise 8-)

Just a quick example to demonstrate this. In the iReceptor gateway, we have a hierarchy that consists of lab, study, subject, sample. We thought it might be useful to group studies that come from a lab together in a hierarchy that is presented to the user... This is an inferred relationship from the data with no hierarchy defined explicitly. Clearly this wouldn't work if how the labs were grouped together was different across data entries (we use the lab name, if it was spelled differently across studies, then they wouldn't be grouped). But that is fine - it is an inferred structure that we think is useful. We could just as easily group studies in a tree based on the disease they are studying and present that to the user... We can build these relationships on the fly - with the hierarchies being well defined (and very easy to build) for those fields that have IDs that are required in MiAIRR...

My $0.02 worth 8-)

javh commented 5 years ago

Could this be solved via an include system? Ie, reserving the top-level field include as a way to reference other (parent) metadata files?

schristley commented 5 years ago

They are captured in the sense that study, subject, sample each have unique identifiers within their "parent" classes.

If those relationships are sufficient then I'm good with that. That suggests a simple normalized structure that currently is supported by MiAIRR. I agree, I don't want a particularly deep or complex hierarchy either.

What's left unspecified by MiAIRR is the relationship between those three objects and a set of rearrangement annotations (whether in a file or a repository). The MiAIRR sample is the primary biological sample and is not 1-to-1 with a FASTQ file with sequences in it, so if a tool is trying to differentiated between cell types, which is coded at the TissueProcessing level, then the sample identifier isn't sufficient to distinguish. Stated another way, should Repertoire be a composite object (or tuple) with an identifier, or can we just add a repertoire_id field at the NucleicAcidProcessing level?

I actually like the idea of supporting both, a fully denormalized format, and a normalized format based upon study, subject, sample. And being able to go back and forth between the two might be a useful validation check on the metadata. That would also be a good feature to add to the reference library.

My thought is that the in-memory representation, e.g. used by a analysis tool processing rearrangement data, could be either denormalized or normalized. The normalized would be particularly useful if an analysis tool adds annotations fields into the metadata records, e.g. my tools operate on groups of repertoires so they add a group field at the repertoire level, then saves the metadata with the additional annotation.

bcorrie commented 5 years ago

@schristley

I actually like the idea of supporting both, a fully denormalized format, and a normalized format based upon study, subject, sample. And being able to go back and forth between the two might be a useful validation check on the metadata. That would also be a good feature to add to the reference library.

Do you mean having two file formats, a normalized and denormalized?

schristley commented 5 years ago

Possibly. I suppose there could be a single denormalized file format, with a corresponding in-memory normalized structure, or there could be two file formats. Regardless, that's just an idea that we probably cannot really decide upon until later. I think the key question that needs to be answered first is the relationship between the MiAIRR objects and rearrangements. Once we have that decided upon then the file format may just fall into place.

schristley commented 5 years ago

Also we are forgetting the SoftwareProcessing section of MiAIRR. We should think how to properly support multiple analysis workflows on the same FASTQ file coming off the sequencer. If we run two workflows from the same FASTQ and produce two AIRR TSV files, should each file have its own repertoire_id (two separate records in a denormalized format)?

lgcowell commented 5 years ago

I would say the same repertoire_id, because both are analyses of the same repertoire.

schristley commented 5 years ago

That makes sense, as we want to keep the semantics of a "sample repertoire" as defined by the study with repertoire_id. We still need to incorporate the relationship with an analysis workflow so the additional field rearrangement_set_id might serve that purpose.

bcorrie commented 5 years ago

This seems like one of those cases where we need to discuss the value of having an identifier at the "meta-data" level for two different analyses of the same repertoire. Or is it acceptable to use the rearrangement_set_id for each analysis to identify the set of rearrangements that belong to a unique repertoire analysis.

In looking at this more closely, I am confused about a couple of things in the MiAIRR standard and the spec (gasp). 8-)

I have two questions:

1) My understanding of the SoftwareProcessing group in MiAIRR is that it describes how rearrangements are created for a specific analysis of a repertoire, correct? In the current spec we have a 1:1 relationship between Software Processing and the study. This should probably be 1:n in that a given study could have more than one analysis of a repertoire (use different tools). I suppose this is exactly what we are discussing in this thread. In addition, one could imagine having different analyses from the same repertoire in different studies. But that is probably not practical to solve in the short term...

2) This may be totally my misunderstanding/ignorance, but shouldn't the germline database be part of the Software Processing class of data. Currently it is part of the rearrangement, so each sequence annotation would have a germline database assigned to it. Isn't the germline database part of the software processing pipeline that created a set of rearrangements. That is, it is the combination of tool (eg. igblast) and germline DB (eg. IMGT version X.Y.Z) that needs to be tracked as part of the software processing to create a set of rearrangements for a repertoire. Correct me if I am wrong... I don't think the way it is now in MiAIRR is wrong, but it also doesn't seem the most logical either. I can't see why you would have a different germline DB used for a given set of rearrangements without using a different Software Processing pipeline. Or more importantly have many different germline DBs across many rearrangements for a specific repertoire.

schristley commented 5 years ago

confused about a couple of things in the MiAIRR standard and the spec

me too! MiAIRR left the software processing as mostly unspecified and unstructured, and what we have in the spec is a quick hash that was never properly vetted, so it's good we are doing it now.

In the current spec we have a 1:1 relationship between Software Processing and the study. This should probably be 1:n in that a given study could have more than one analysis of a repertoire (use different tools).

I had the exact same thought. I've been playing around with the spec in my development environment to make that 1:n to see how it looks. MiAIRR defined as 1:1 with the study because you include all of software processing information for all of the subject, samples, files, etc with a big block of unstructured text. I'm thinking 1:n with the study might not be very good either, because any processing performed differently for individual files is hidden within the unstructured text. What might be better is 1:1 with a nucleic processing record, that allows each individual FASTQ file to be specified how it was processed.

We don't have to change MiAIRR. Those individual software processing records can be rolled up into a single composite record for MiAIRR submission.
Running n multiple analysis tools on the same file creates an n:1 relationship between software processing and a FASTQ file, which can be handled naturally in the schema.
If the identical software processing is performed on multiple files, that can be indicated like we've been discussing with a unique identifier for the software processing record, which can either be in a normalized format or a denormalized format to indicate redundant entries.
That allows us to attach file specific attributes. For example, to your second question, the germline database used in the processing could be attached. This would naturally allow multiple germline databases to be used within a single study (think using a standard db versus a custom db to compare the two). Other examples might be primer files, barcodes, etc.

shouldn't the germline database be part of the Software Processing class of data

Yes, I agree. I think the current situation is a hack, because that info is unstructured in MiAIRR, it was thrown into the rearrangements so it was specific. I could be misremembering though as that was awhile ago. With the germline WG moving forward with specs and software, we need to more properly design this relationship. That is one of the TODO items for DRWG before the next AIRR meeting.

bussec commented 5 years ago

@bcorrie :

My understanding of the SoftwareProcessing group in MiAIRR is that it describes how rearrangements are created for a specific analysis of a repertoire, correct?

Yes, that was (and it still is) the idea.

In the current spec we have a 1:1 relationship between Software Processing and the study. This should probably be 1:n in that a given study could have more than one analysis of a repertoire (use different tools).

This is actually beyond the scope of MiAIRR which assumes a fully denormalized structure. However, in the NCBI implementation the DOI link to the SoftwareProcessing document is only provided in data set 6 (i.e. each individual rearrangement). Thus it would be possible to have a 1:n relation between the raw data at SRA and SoftwareProcessing. Note that there is currently no "down-link", i.e. the SRA record is not required to refer to the DOI (as it might belong to a third party, thereby creating problems with access control).

This may be totally my misunderstanding/ignorance, but shouldn't the germline database be part of the Software Processing class of data.

Not sure whether it should be or not. IIRC there were two reasons for the current specification:

We wanted to be able to support multiple VDJ segment calls, coming from different databases. This is not spelled out in MiAIRR itself (as it is not minimal), but documented here in the MiAIRR-to-NCBI specification. Yes, this could still have been documented in SoftwareProcessing, however a) this might be done selectively for just a couple of (contested) segment calls and b) it should be easily accessible (see 2.)
The /db_xref qualifier in the INSDC FT offered a standardized way to annotate this. I am aware that this way the implementation influences the design (which it should not), but with all the requirements demanded by AIRR for annotation of the rearrangement, this one did not seem to be a big thing.

bussec commented 5 years ago

@schristley :

As discussed in #145, NucleicAcidProcessing will not necessarily have an 1:1 relation with the sequencing run. Even if we solve that issue, the re-sequencing scenario described there might be at variance with the 1:n assumption of SequencingRun:SoftwareProcessing, as e.g. two sequencing runs (from the same library) could be processed together. Especially if a pipeline uses minimum read thresholds this would not be equivalent with processing both runs independently.

schristley commented 5 years ago

An example "denormalized" metadata file

bcorrie commented 5 years ago

Moving discussion as per the file format to this issue from #176

@schristley as you say in #176 I think we have discussed the structure a lot, but I don't think we have discussed the file format extensively.

In thinking about the structure and how it is represented in the YAML example, it looks like it is denormalized, but there is a question about the degree of denormalization (is that really a thing 8-) Or alternatively, what is the most granular thing in your denormalization. Or another way of thinking about it, what entities does your denormalized file contain?

In your example, the denormalized file consists of a set of repertoires, but each repertoire could have more than one rearrangement set. So is not the logical denormalized entity for such a file the rearrangement set, not the repertoire. Note this conversation seem eerily familiar to me, so excuse me if we have gone over this, but if you are denormalizing then shouldn't you denormalize at the most granular level?

I think my confusion comes from the fact that there was a definition that we had discussed in CRWG here: https://github.com/airr-community/common-repo-wg/issues/17 that I thought I understood, but the current definition that we have is different, and I seemed to have missed when that change occurred.

In particular the repertoire definition went from having a single sample and a 1:n relationship with rearrangement sets (which I understand) to having an array of samples and an array of rearrangement sets (which I don't understand). The array of rearrangement sets I get, as this is simply encoding the 1:n relationship from 1 repertoire possibly having more than one rearrangement set. The confusing thing to me is that a single repertoire can come from more than one sample. Does this make sense? It seems to me that this structural change actually makes it impossible to truly denormalize the data in a meaningful way. In fact, because there is an array of samples and an array of rearrangement sets in a single repertoire, doesn't this make it impossible (at least in a general sense) to map a rearrangement set to a unique sample?

I'm confused... 8-)

I also think that the denormalized YAML file is not correct with the current spec, as the sample entity in the YAML file is a single sample, where I believe it should be an array of samples with a single sample entity in it.

schristley commented 5 years ago

So is not the logical denormalized entity for such a file the rearrangement set, not the repertoire. Note this conversation seem eerily familiar to me, so excuse me if we have gone over this, but if you are denormalizing then shouldn't you denormalize at the most granular level?

You are correct that there is a choice about what entity (level) to denormalize around. We could pick Study for example and then a whole study with all its subjects, samples and etc would be in a single object. So why repertoire? Well that came out of CRWG which felt that an immunological viewpoint was more appropriate for querying. The query endpoints for repositories should therefore reflect this fact. We debated about multiple endpoints for the different entities (study, subject, etc) but simpler won the day with a single /repertoire endpoint for querying study metadata.

One thing I've been swirling around in my brain is how to explain the fundamental difference between the study metadata and the rearrangement data, previously I've tried to explain them as biological versus informatic concepts, here is new attempt with other concepts:

The study metadata describes the experimental design and what the scientist "intends" to observe. It is a description of intent (along with actions/protocols to realize that intent). There is (almost) no guarantee that what the scientist intends to observe is actually what is observed!
The rearrangement data is annotation/analysis of what was actually observed, and RearrangementSet is a description of the annotation/analysis process. There is (almost) no guarantee that what was actually observed matches what the scientist intended!

So to put it another way, CRWG felt that users would want to query on the first bullet, the intent of the experimental design, when querying study metadata. If querying was performed on RearrangementSet then this would not be satisfied, it would be querying on the wrong concept.

Now of course, you can query on what was actually observed, which leads to the second endpoint /rearrangement for doing that.

Now Repertoire isn't actually one of the entities in study metadata, it is a composite object which links to all of the study metadata entities, as well as bridges over to the observed data and its annotation/analysis. The current definitions still holds but there has been some refinement of the structure of the Repertoire object.

The confusing thing to me is that a single repertoire can come from more than one sample. Does this make sense?

Yes, this is one of the refinements. Partially spurred by #145 where the current structure could not represent the re-sequencing of libraries (i.e. the same NucleicAcidProcessing but multiple sequencing runs). It was also recognized that sometimes multiple samples are combined together into a single repertoire. It's not common but it happens often enough that we wanted to support it directly. I don't think an issue was created for this, I think all the discussion happened in the DRWG calls. A simple example is imagine two blood draws taken from the same person, processed and sequenced separately, but where the scientist considers the two sample together to be one repertoire.

I also think that the denormalized YAML file is not correct with the current spec.

No it might not be with all the discussion and changes. The relationship between subject and diagnosis should be corrected too.

schristley commented 5 years ago

@bcorrie I've updated florian.airr.yaml so it should match the schema now.

bcorrie commented 5 years ago

Yes, this is one of the refinements. Partially spurred by #145 where the current structure could not represent the re-sequencing of libraries (i.e. the same NucleicAcidProcessing but multiple sequencing runs). It was also recognized that sometimes multiple samples are combined together into a single repertoire.

I think I understand the need for having more than one sample in a repertoire (although I am still vague on the definition of repertoire). My confusion stems from the fact that in the current definition of repertoire we can have two different samples (because sample is an array) and two different rearrangement sets (because that is also an array) and there is no way to determine which rearrangement set is related to which sample. This seems problematic to me, no? Consider the case:

In my array of samples I have one sample where sample.cell_processing.cell_subset = "Naive B Cell" and for the other sample I have sample.cell_processing.cell_subset = "Memory CD4+ T Cell"
In my array of rearrangement sets, I have one rearrangement set that has a bunch of TCR rearrangements and I have on rearrangement set that has a bunch of IG rearrangements.

Because both of these are arrays, they could be listed in any order. So there is no way to know that whether the IG rearrangement set belongs to the "Naive B Cell" or the "Memory CD4+ T Cell" sample.

Is this not really bad??? 8-)

It seems to me like either:

We need to go back to having a repertoire having only one sample, or
We need to make it so that each sample has 1 or more rearrangement sets (the array of rearrangement sets is within the sample object).

???

bcorrie commented 5 years ago

For example:

Repertoire:
    discriminator: AIRR
    type: object
    properties:
        repertoire_id:
            type: string
            description: Identifier for the repertoire object.
        study:
            $ref: '#/Study'
        subject:
            $ref: '#/Subject'
        sample:
            type: array
            items:
                allOf:
                    - $ref: '#/Sample'
                    - $ref: '#/CellProcessing'
                    - $ref: '#/NucleicAcidProcessing'
                    - $ref: '#/SequencingRun'
                    - sequence_annotation:
                      type: array
                      items:
                          $ref: '#/RearrangementSet'

Not sure if my YAML syntax is correct... 8-)

schristley commented 5 years ago

@bcorrie I know I wrote another repertoire definition recently, it took me a moment but I found it in this comment on the PR.

we can have two different samples (because sample is an array) and two different rearrangement sets (because that is also an array) and there is no way to determine which rearrangement set is related to which sample. This seems problematic to me, no?

A single rearrangement set is suppose to apply to all samples in the repertoire, not to individual samples.

In my array of rearrangement sets, I have one rearrangement set that has a bunch of TCR rearrangements and I have on rearrangement set that has a bunch of IG rearrangements.

I guess technically this is possible but it doesn't make much sense (to me) from an annotation/analysis point of view. There are no tools that I know of currently that can analyze a "repertoire" with both B and T cells within it, generally it's all one or the another.

To make it more valid, let's say it is two samples both for TCR. You would be suggesting that one sample was processed (say) with IgBlast, and the other sample was processed by MiXCR. Again this is technically possible. You are suggesting that this implies two rearrangement sets, but that is incorrect, it is only a single rearrangement set, as it applies to all samples in the repertoire, so the description of the software process would describe the multiple tools (IgBlast, MiXCR) used, what files they process, etc.

My argument would be that the relationship between the samples and the rearrangements set you are looking for would be in SoftwareProcessing. However, maybe I'm missing something, are you able to point to any studies (maybe in iReceptor) where this situation arises?

schristley commented 5 years ago

@bcorrie @bussec During CRWG, Brian asked this question:

Can a rearrangement record come from more than one sample?

and my initial response was no, but after thinking about it, yes, it is possible. The reason is merging/collapsing can occur during software processing which breaks the 1-to-1 relationship between a rearrangement and a sample. Here is a simple example.

Given a repertoire with two samples, SampleA and SampleB, which has two different sequencing runs and thus two sets of raw files. Imagine that a sequence A in SampleA is identical to a sequence B in SampleB. In some processing tools, those sequences A and B generate only a single rearrangement record where duplicate_count=2, i.e identical sequences are collapsed with a count. In this case, that single rearrangement record is associated with two samples.