airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

repertoire_id VS rearrangement_set_id #181

Closed bcorrie closed 4 years ago

bcorrie commented 5 years ago

Creating an explicit issue around repertoire_id and rearrangement_set_id to capture the discussion around their mapping as discussed here (https://github.com/airr-community/airr-standards/issues/144#issuecomment-466566315)

@schristley says:

To make it more valid, let's say it is two samples both for TCR. You would be suggesting that one sample was processed (say) with IgBlast, and the other sample was processed by MiXCR. Again this is technically possible. You are suggesting that this implies two rearrangement sets, but that is incorrect, it is only a single rearrangement set, as it applies to all samples in the repertoire, so the description of the software process would describe the multiple tools (IgBlast, MiXCR) used, what files they process, etc.

My argument would be that the relationship between the samples and the rearrangements set you are looking for would be in SoftwareProcessing. However, maybe I'm missing something, are you able to point to any studies (maybe in iReceptor) where this situation arises?

It was my understanding that one of the main purposes of the RearrangementSet (and the rearrangement_set_id) was to differentiate between the different SoftwareProcessing that a given sample might undergo to produce a different set of rearrangements. That is if I have an aggregate sample object as defined in "Repertoire":

        sample:
            type: array
            items:
                allOf:
                    - $ref: '#/Sample'
                    - $ref: '#/CellProcessing'
                    - $ref: '#/NucleicAcidProcessing'
                    - $ref: '#/SequencingRun'

I then process that with two different annotation tools as described above (igblast and MiXCR). Given the schema we have, that would create two different "RearrangementSet" objects with two different rearrangement_set_ids with different SoftwareProcessing metadata for each.

In the Florian data for a given repertoire, this would look liker:

    sequence_annotation:
      - rearrangement_set_id: RearrangementSet1MiXCR
        software:
          software_versions: "MiXCR v3.0.5"
      - rearrangement_set_id: RearrangementSet2igblast
        software:
          software_versions: "igblast 1.12.0"

In a related AIRR TSV rearrangement file, I would have something like:

sequence_id    v_call     d_call    j_call   repertoire_id   rearrangement_set_id
SEQ1234          IGHV3   IGHD3  IGHJ3  REP1 RearrangementSet1MiXCR
... a bunch of annotated sequence from MiXCR for the same repertoire ...
SEQ1234          IGHV3   IGHD3  IGHJ3  REP1 RearrangementSet2igblast
... a bunch of annotates sequences from igblast for the same repertoire ...

In the AIRR TSV file I need to be able to differentiate between these two rearrangement sets that come from the same repertoire... So I need to have a rearrangement_set_id for the two different annotation processes that were carried out, no?

javh commented 5 years ago

Not to muddy the waters, but, from the other issue, the idea of a single rearrangement set with B and T cells isn't that peculiar. Especially in the context of single-cell data, where you might have selective amplification of both TR and IG, or neither if you attempt to reconstruct the V(D)J-C sequences from total gene expression.

It might makes sense to prefilter/split the data somehow to do V(D)J alignment on a TR or IG restricted germline database, but I'm not sure if that constitutes a separate "repertoire".

schristley commented 5 years ago

It was my understanding that one of the main purposes of the RearrangementSet (and the rearrangement_set_id) was to differentiate between the different SoftwareProcessing that a given sample might undergo to produce a different set of rearrangements.

Close, I'd alter that to use the word repertoire instead of sample, using "sample" gives it a different meaning:

RearrangementSet (and the rearrangement_set_id) was to differentiate between the different SoftwareProcessing that a given repertoire might undergo to produce a different set of rearrangements.

In the AIRR TSV file I need to be able to differentiate between these two rearrangement sets that come from the same repertoire... So I need to have a rearrangement_set_id for the two different annotation processes that were carried out, no?

That's correct.

bcorrie commented 5 years ago

@schristley any further thoughts on this after Data Rep and MiniStd meetings last week? I think the main critical path for me in the context of a metadata file format is that we need to be able to easily map from the lowest level identifier at the rearrangement level (rearranegement_set_id) up through the hierarchy to sample, subject, and study. I don't think we have that in the current instantiation of the Repertoire object

My takeaway was maybe that we should be considering the Repertoire as you are thinking of using it to capture "groupings for analysis" and the Rearrangement Set that is capturing "groupings for structure" at the most granular component the MiAIRR hierarchy as two different use cases for a metadata file format (and indeed the CRWG API respsonse)? It feels to me like we are trying to fit two different things onto one construct and it isn't working very well...

schristley commented 5 years ago

any further thoughts on this after Data Rep and MiniStd meetings last week?

Nope.

we need to be able to easily map from the lowest level identifier at the rearrangement level (rearranegement_set_id) up through the hierarchy to sample, subject, and study

Follow the repertoire_id, that will give you the study, subject and samples.

bcorrie commented 5 years ago

The problem is it doesn't give a unique path from a specific rearrangement (with a specific rearrangement_set_id) in an AIRR TSV file to a repertoire.sample as far as I can tell.

Correct me if I am wrong, but I believe it is true to say that any given single line in an AIRR TSV rearrangement file (which has an associated rearrangement_set_id) should be able to be derived from a unique set of MiAIRR objects following the object hierarchy in the airr-schema.yaml file. If you consider this from the bottom up for a single rearrangement and the -> operator as being a "produced by/from" relationship, I think that a given rearrangement in an AIRR TSV file (a sequence that has been annotated with some tool to assign v_call, d_call, j_call, junction_aa, etc) that has a specific rearrangement_set_id is produced as follows.

Rearrangement -> SoftwareProcessing -> SequencingRun -> NucleicAcidProcessing -> CellProcessing -> Sample -> Subject -> Study

With the current Repertoire object, this relationship can not be constructed uniquely because we have an array of repertiore.sample and an array of repertiore.sequence_annotation with no mapping between them. We have something that looks more like this:

Rearrangement -> SoftwareProcessing ->|SequencingRun -> NucleicAcidProcessing -> CellProcessing -> Sample| -> Subject -> Study
                                      |SequencingRun -> NucleicAcidProcessing -> CellProcessing -> Sample|
                                      |SequencingRun -> NucleicAcidProcessing -> CellProcessing -> Sample|
                                      |SequencingRun -> NucleicAcidProcessing -> CellProcessing -> Sample|

Thus it is impossible for me to determine exactly which repertoire.sample a specific rearrangement in the AIRR TSV file comes from. Thus I can't tell which SequencingRun, which NucleicAcidProcessing, nor whichCellProceesing was applied for this specific rearrangement.

I think this is problematic, no? Am I missing something?

bcorrie commented 5 years ago

I created a new branch called rearrangement_set with the addition of what I called a RearrangementSetFlat object in the airr-schema.yaml which is what seems to me to be what makes sense for a mechanism to store repertoire level data starting at the most granular level (the RearrangementSet).

I also made a change to Florian's test data (a new file called florian.airr.flat.yaml) which uses this structure. Because there is a 1-1 mapping of RearrangementSets to Samples in Florian's data this is almost identical to the original florian.airr.yaml except the top level object is RearrangementSetFlat (rather than Repertoire) and there is no longer an array (of one element) for the sample and sequence_annotation fields.

schristley commented 5 years ago

I think this is problematic, no? Am I missing something?

Ok, I understand. This is a problem for those uncommon studies which combines multiple samples together into a repertoire. For common studies which have a single sample per repertoire, this isn't an issue because there is only one sample, so no ambiguity. Do you agree?

Okay, given this, can we handle these uncommon studies in a different way? Or come up with a way that allows for what you want?

The first thing we have to deal with is this assumption:

unique path from a specific rearrangement in an AIRR TSV file to a specific repertoire.sample

Unfortunately, this isn't always true. It's very common for tools to collapse duplicate sequences into a single sequence entry, and only process that single sequence entry thus saving time and computation. If sequences are collapsed across samples, then that single sequence entry breaks the assumed one-to-one relationship between the sequence and sample. I don't see a solution to this because I don't know how to force a one-to-one relationship on that collapsed sequence.

For argument sake, let's say we don't have to worry about these collapsed sequences, or the person doing the software processing runs tools in a way to prevent any collapsing across samples, thus a one-to-one relationship is preserved. But this is still problematic...

First, lets digress briefly into how links would be followed in the current schema.

In your example, you always seem to want to go through SoftwareProcessing, which I consider as odd. I don't actually understand what "foreign keys" you would use to go from a Rearrangement directly to SoftwareProcessing and onward. The quicker approach is to go right to the repertoire where you have direct access to the study, subject and samples. Here I'm showing the specific "foreign key" used to form the relationship:

Rearrangement -> repertoire_id -> Repertoire -> study, subject, sample, etc.

If you want to get access to the SoftwareProcessing fields then you can use rearrangement_set_id in combination with repertoire_id like this:

Rearrangement -> repertoire_id -> Repertoire -> rearrangement_set_id -> RearrangementSet -> SoftwareProcessing

Here rearrangement_set_id maintains its current definition of distinguishing different software processing for the repertoire. As an aside, maybe we should rename rearrangement_set_id, which is generic, to something like software_processing_id which is more meaningful?

Anyways, back to matter at hand, one possible solution is that the sequence_id can be used to find the original sequence in the original raw data file, and that would give you the SequencingRun and thus the Sample like this:

Rearrangement -> repertoire_id -> Repertoire -> sequence_id -> (search through raw sequencing files) -> SequencingRun

But this is no good, for one the raw files might not be available, and two those raw files are not indexed so that would require searching through the whole file -> not practical!

So the only other solution I can think of is to add a new identifier into Rearrangement that explicitly links to the sample it came from, but there is ambiguity here. This is going to require a lengthy description so bear with me.

First, remember that we introduced samples as an array because some experimental designs combine multiple samples together into a single repertoire. So taking your example above where there are four sample entries, and I'll flip the arrows like this:

Sample -> CellProcessing -> NucleicAcidProcessing -> SequencingRun
Sample -> CellProcessing -> NucleicAcidProcessing -> SequencingRun
Sample -> CellProcessing -> NucleicAcidProcessing -> SequencingRun
Sample -> CellProcessing -> NucleicAcidProcessing -> SequencingRun

Second, remember that the metadata is denormalized, so data might be duplicated at any of these levels. Let's explicitly write them out as specific cases.

CASE 1: Replication at SequencingRun, so the Sample, CellProcessing and NucleicAcidProcessing data is identical across the 4 samples. It looks like this in a "normalized" way:

Sample -> CellProcessing -> NucleicAcidProcessing -> SequencingRun
                                                  -> SequencingRun
                                                  -> SequencingRun
                                                  -> SequencingRun

CASE 2: Replication at NucleicAcidProcessing, so the Sample and CellProcessing data is identical across the 4 samples, and this implies the SequencingRun is different.

Sample -> CellProcessing -> NucleicAcidProcessing -> SequencingRun
                            NucleicAcidProcessing -> SequencingRun
                            NucleicAcidProcessing -> SequencingRun
                            NucleicAcidProcessing -> SequencingRun

CASE 3: Replication at CellProcessing, so the Sample data is identical across the 4 samples, and this implies the SequencingRun and NucleicAcidProcessing are different.

Sample -> CellProcessing -> NucleicAcidProcessing -> SequencingRun
          CellProcessing -> NucleicAcidProcessing -> SequencingRun
          CellProcessing -> NucleicAcidProcessing -> SequencingRun
          CellProcessing -> NucleicAcidProcessing -> SequencingRun

CASE 4: Replication at Sample so everything is different. Christian has objected at this being a valid scenario, so we could drop this case without problem.

Sample -> CellProcessing -> NucleicAcidProcessing -> SequencingRun
Sample -> CellProcessing -> NucleicAcidProcessing -> SequencingRun
Sample -> CellProcessing -> NucleicAcidProcessing -> SequencingRun
Sample -> CellProcessing -> NucleicAcidProcessing -> SequencingRun

Now of course there are other cases where you mix and match replication at different levels for different number of samples, but let's set those aside for now.

From looking at these cases, what sticks out is that SequencingRun is always different. So that seems to be solution, if we can link to the specific SequencingRun then we are set. Hold on, not so fast...

Third, remember that samples can be multiplexed together into a SequencingRun, and are demultiplexed into their individual samples during software processing. These "samples" don't even have to be from the same repertoire, they can be completely separate subjects for example. Let's show how 3 different repertoires are multiplexed together:

CASE 5: No replication, samples from different repertoires are multiplexed into the same SequencingRun, so all data is different except for the SequencingRun, which is identical for all:

Repertoire -> Sample -> CellProcessing -> NucleicAcidProcessing -> SequencingRun
Repertoire -> Sample -> CellProcessing -> NucleicAcidProcessing
Repertoire -> Sample -> CellProcessing -> NucleicAcidProcessing

Just for grins let's throw in a combination of both, multiplexed samples and some replication:

CASE 6: Three repertoires are multiplexed together. One repertoire is replicated at the CellProcessing level so NucleicAcidProcessing is different but then multiplexed into the same SequencingRun

Repertoire -> Sample -> CellProcessing -> NucleicAcidProcessing -> SequencingRun
Repertoire -> Sample -> CellProcessing -> NucleicAcidProcessing
Repertoire -> Sample -> CellProcessing -> NucleicAcidProcessing
                        CellProcessing -> NucleicAcidProcessing

Now we are at a real impasse. There doesn't seem to be an entity with a unique identifier that will always have a one-to-one relationship with Rearrangement.

The only solution I can think of is to take the "sample" blob in the repertoire, and give it an identifier (say) sample_blob_id and that identifier will need to be put into each Rearrangement record.

        sample:
            type: array
            items:
                sample_blob_id:
                    type: string
                    description: uniquely identify this combination of entities
                allOf:
                    - $ref: '#/Sample'
                    - $ref: '#/CellProcessing'
                    - $ref: '#/NucleicAcidProcessing'
                    - $ref: '#/SequencingRun'

Even so, this long description is still assuming a one-to-one relationship between Rearrangement and Sample is valid. Once we allow collapsing of duplicate sequences, that assumption fails and a single sample_blob_id isn't sufficient.

We could then "save" it by allowing multiple sample_blob_id values to be stored in Rearrangement, thus indicating the set of Samples, but then how are you going to decide which Sample metadata values to use?

schristley commented 5 years ago

I'm not seeing a solution that unambiguously works for all the different possibilities. We might have to punt and say that if anybody wants this mapping, they should add their own identifier at an appropriate place in the metadata structure in order to give the mapping they desire.

While sample_blob_id mostly works, I'm hesitant to enforce these requirements on all studies and all rearrangements, just to support a few studies.

bcorrie commented 5 years ago

Ok, I understand. This is a problem for those uncommon studies which combines multiple samples together into a repertoire. For common studies which have a single sample per repertoire, this isn't an issue because there is only one sample, so no ambiguity. Do you agree?

Yes, I think so... If the array of samples consist of a single sample, then this is not an issue. It seems to me that if this is the normal case there should be way to represent that more effectively, but I don't know what that is yet...

I am still absorbing the rest and we are going to discuss in our group meeting today.

bcorrie commented 5 years ago

Hi @schristley sorry I haven't gotten back to you on this...

We are still struggling with this a bit. First maybe a couple of clarifications for your Case 5 and Case 6

CASE 5: No replication, samples from different repertoires are multiplexed into the same SequencingRun, so all data is different except for the SequencingRun, which is identical for all:

Repertoire1 -> Sample1 -> CellProcessing1 -> NucleicAcidProcessing1 -> SequencingRun1
Repertoire2 -> Sample2 -> CellProcessing2 -> NucleicAcidProcessing2
Repertoire3 -> Sample3 -> CellProcessing3 -> NucleicAcidProcessing3

Is the above the correct interpretation? That is, in this case there are three different repertoires, each with a single, unique sample, with each of those samples having different CellProcessing and different NucelicAcidProcessing, but all of these are processed in a single SequencingRun?

CASE 6: Three repertoires are multiplexed together. One repertoire is replicated at the CellProcessing level so NucleicAcidProcessing is different but then multiplexed into the same SequencingRun

Repertoire1 -> Sample1 -> CellProcessing1 -> NucleicAcidProcessing1 -> SequencingRun1
Repertoire2 -> Sample2 -> CellProcessing2 -> NucleicAcidProcessing2
Repertoire3 -> Sample3 -> CellProcessing3 -> NucleicAcidProcessing3
                          CellProcessing4 -> NucleicAcidProcessing4

In this case it is the same as the above, but Sample3 has two different types of cell processing applied with each of those having a different NucleicAcidProcessing, but again, all are processed through a single SequencingRun.

Is that correct?

bcorrie commented 5 years ago

A follow up, and maybe a counter-example, from your question @schristley

In your example, you always seem to want to go through SoftwareProcessing, which I consider as odd. I don't actually understand what "foreign keys" you would use to go from a Rearrangement directly to SoftwareProcessing and onward. The quicker approach is to go right to the repertoire where you have direct access to the study, subject and samples.

Below I will stick to the term SoftwareProcessing and software_processing_id as per Issue #188.

I see issues approaching things from the bottom up - starting at a single rearrangement (an annotated sequence) in an AIRR TSV file. It has a v_call, d_call, j_call. These features were assigned by a certain SoftwareProcessing pipeline. So my specific rearrangement has an identifier (software_processing_id) that points to that SoftwareProcessing entity in the metadata. This single rearrangement also has a repertoire_id that points to the associated Repertoire entity in the metadata.

In the most basic complete case, there is a Metadata file with single repertoire in it and an AIRR TSV file that contains a single rearrangement. This would look like the following (I think).

Repertoire1:
    Study1
    Subject1
    Array:
        Sample1 -> CellProcessing1 -> NucleicAcidProcessing1 -> SequencingRun1
    Array:
        SoftwareProcessing1

Rearrangement AIRR TSV file has:
    v_call, d_call, j_call, Repertoire1, SoftwareProcessing1

OK, this is all good... 8-)

Now consider the following case:

You extend the above such that you have a single Repertoire with a single sample that has different cell processing for different types of B-cells. I think this is exactly what the Repertoire is intended for, correct? You then have an AIRR TSV file with a single rearrangement from each CellProcessing process (one rearrangement per B-cell subset). This would give:

Repertoire1:
    Study1
    Subject1
    Array:
        Sample1 -> CellProcessing1 -> NucleicAcidProcessing1 -> SequencingRun1
        Sample1 -> CellProcessing2 -> NucleicAcidProcessing2 -> SequencingRun1
        Sample1 -> CellProcessing3 -> NucleicAcidProcessing3 -> SequencingRun1
    Array:
        SoftwareProcessing1

Rearrangement AIRR TSV file has:
    v_call, d_call, j_call, Repertoire1, SoftwareProcessing1
    v_call, d_call, j_call, Repertoire1, SoftwareProcessing1
    v_call, d_call, j_call, Repertoire1, SoftwareProcessing1

In this instance, I have no way of knowing whether a given rearrangement in the AIRR TSV file is from CellProcessing1, CellProcessing2, or CellProcessing3 because the only identifiers that I have in my rearrangement data in the AIRR TSV file is for the Repertoire (which is ambiguous in this case) or the SoftwareProcessing (which also can't be linked to an individual CellProcsssing).

How do you handle this ambiguity with the current structure? The only way I can see to do this is to have three Repertoire objects...

schristley commented 5 years ago

How do you handle this ambiguity with the current structure?

Use the sequence_id, that is the only guaranteed identifier in both the raw sequencing files and the rearrangements and is explicitly there to form this relationship across SoftwareProcessing. Then take the file name and the repertoire_id and search the SequencingRun objects in the Repertoire for the appropriate raw sequence file name, and you got everything.

Now you threw in sample multiplexing in your example with all SequencingRun1, not sure if you meant to do that. If you meant Run1, Run2, Run3 then my above works. If the samples are multiplexed then you need the barcode and have to also search at NucleicAcidProcessing, but MiAIRR punted on barcodes so...

schristley commented 5 years ago

but all of these are processed in a single SequencingRun?

Yes, it is correct. It is called sample multiplexing. Barcode nucleotide sequences are added (to the DNA) during NucleicAcidProcessing, then during SoftwareProcessing the raw sequencing files are demultiplexed into individual files by searching for those barcode. Pre-processing tools like VDJPipe and pRESTO do that. It is a common sequencing technique, not just AIRR-seq.

Unfortunately, that is the simplest barcode case, and the barcoding schemes can get quite complicated, and MiAIRR does not require that information. MiAIRR has a standard practice that Christian has elaborated in #98 but its incomplete.

schristley commented 5 years ago

We should get a mini consensus on this if going forward, @javh @bcorrie @laserson @bussec @scharch

If we need to be able to perform that mapping (a specific rearrangement entry to its sample entry), and I agree this would be a nice thing to have, then we need a sample_processing_id identifier to make it a one-to-one mapping.

I'm good with this, so long as it is required=false and nullable=true with a default of null. That way studies that don't need it, won't waste disk space providing it.

// maybe provide this as a get_sample() in reference library

if len(rep.sample) == 1;
    // majority of studies
    sample = sample[0]
else;
    // use sample processing id
    sample = sample[sample_processing_id]
fi
Repertoire1:
    Study1
    Subject1
    Array:
        SampleProcessing1 -> Sample1 -> CellProcessing1 -> NucleicAcidProcessing1 -> SequencingRun1
        SampleProcessing2 -> Sample1 -> CellProcessing2 -> NucleicAcidProcessing2 -> SequencingRun1
        SampleProcessing3 -> Sample1 -> CellProcessing3 -> NucleicAcidProcessing3 -> SequencingRun1
    Array:
        SoftwareProcessing1

Rearrangement AIRR TSV file has:
    v_call, d_call, j_call, Repertoire1, SoftwareProcessing1, SampleProcessing1
    v_call, d_call, j_call, Repertoire1, SoftwareProcessing1, SampleProcessing2
    v_call, d_call, j_call, Repertoire1, SoftwareProcessing1, SampleProcessing3
scharch commented 5 years ago

I confess, this stuff makes my head spin, and I'm not sure I ever come to the same conclusions about the best way to do it two days in a row. But I thought, as @schristley said above, that the link into the metadata would be based exclusively on sequence_id. I can see the flaws in this, but I'm also generally averse to adding more fields to the Rearrangements tsv. I guess @schristley's solution seems ok; I certainly don't have any better ideas...

bcorrie commented 5 years ago

@scharch I hear you - me too 8-)

At the AIRR TSV level, although sequence_id is necessary, I don't think it is sufficient in that:

  1. If a specific sequence_id is processed by two different SoftwareProcessing pipelines (e.g. igblast and mixcr) then we need to be able to differentiate them in an AIRR TSV file with a SoftwareProcessingID.

  2. If we want to be able to identify which SampleProcessing regime was applied to get to a specific SequencingRun we need the SampleProcessingID. Although it is possible to go from a sequence_id in an AIRR TSV file to specific SequencingRun entity that came from a specific sample processing regime, without the SampleProcessingID in the AIRR TSV the only way to determine that link is to search the entire set of fasta/fastq files for the sequence_id (correct me if I am wrong).

  3. If we want to provide a mechanism to tell which Repertoire (the abstract organizational unit of analysis that is defined by the researcher) a single line in an AIRR TSV file comes from, then we need a RepertoireID.

I think if we don't have all three and given our current metadata structure we lose the ability to differentiate at one of these levels. At the same time, it appears there are arguments that both the current structure is desirable and the there is a need for identifying things from the AIRR TSV file at each of these levels.

Given the level of complexity of the metadata structure that we are capturing and the relatively small amount of added information at the rearrangement level (three identifiers, RepertoireID, SampleProcessingID, and SoftwareProcessingID), it seems like having those three fields in a rearrangement would seem to make sense to me. It certainly gives us the most flexibility and power moving forward, and the lack of any one of them seems to restrict things substantially???

scharch commented 5 years ago

Number 2 seems most convincing to me. As far as 1 and 3, I still prefer that 1 AIRR TSV + 1 metadata YAML = 1 Repertoire object, however the researcher defines that. It is then her responsibility to make sure the sequence_ids in that object are unique, even if they may have originally come from different sources that could theoretically have had overlapping ID sets. But since we've moved away from that simple schema for flexibility, yes, I think what you and Scott are proposing makes sense.

javh commented 5 years ago

I tend to agree with @scharch as far as " AIRR TSV + 1 metadata YAML = 1 Repertoire object".

If you want to deviate from that, I'm not seeing a solution that doesn't involve some sort of mapping table (eg, list of sequence ids at the Repertoire, column of repertoire_ids at the Rearrangement level, etc).

schristley commented 5 years ago

If you want to deviate from that, I'm not seeing a solution that doesn't involve some sort of mapping table (eg, list of sequence ids at the Repertoire, column of repertoire_ids at the Rearrangement level, etc).

You are correct that in the general case, we need a mapping table but that's probably too much. I'd like to see an actual study and use case that requires that before adding that complexity.

Just as with sequence_id in rearrangements, when data processing collapses sequences, we don't require a list of all the sequence_ids for the input sequences to maintain a true mapping from output -> input. We settle with just one sequence_id and users need to create their own mapping if they really need it.

I think the same applies here. Having a single sample_processing_id gives a simple 1-to-1 mapping that is sufficient for many cases but not in general.

schristley commented 5 years ago

this has been resolved in the AIRR data model.

bcorrie commented 4 years ago

We don't have a sample_processing_id in our current schema (see #246), reopening to get clarity around this...

schristley commented 4 years ago

please, open a new issue if you want to spec out a sample_processing_id. The original issue repertoire_id and data_processing_is was resolved.