airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

sample_processing_id in Rearrangements and Repertoire spec #249

Closed bcorrie closed 5 years ago

bcorrie commented 5 years ago

In https://github.com/airr-community/airr-standards/issues/181#issuecomment-478657199

We discussed having three _id fields at the rearrangement level to disambiguate mapping rearrangements back to different combinations of repertoires, samples, and data processing.

At the AIRR TSV level, although sequence_id is necessary, I don't think it is sufficient in that:

  1. If a specific sequence_id is processed by two different SoftwareProcessing pipelines (e.g. igblast and mixcr) then we need to be able to differentiate them in an AIRR TSV file with a SoftwareProcessingID.
  2. If we want to be able to identify which SampleProcessing regime was applied to get to a specific SequencingRun we need the SampleProcessingID. Although it is possible to go from a sequence_id in an AIRR TSV file to specific SequencingRun entity that came from a specific sample processing regime, without the SampleProcessingID in the AIRR TSV the only way to determine that link is to search the entire set of fasta/fastq files for the sequence_id (correct me if I am wrong).
  3. If we want to provide a mechanism to tell which Repertoire (the abstract organizational unit of analysis that is defined by the researcher) a single line in an AIRR TSV file comes from, then we need a RepertoireID.

I think if we don't have all three and given our current metadata structure we lose the ability to differentiate at one of these levels. At the same time, it appears there are arguments that both the current structure is desirable and the there is a need for identifying things from the AIRR TSV file at each of these levels.

Given the level of complexity of the metadata structure that we are capturing and the relatively small amount of added information at the rearrangement level (three identifiers, RepertoireID, SampleProcessingID, and SoftwareProcessingID), it seems like having those three fields in a rearrangement would seem to make sense to me. It certainly gives us the most flexibility and power moving forward, and the lack of any one of them seems to restrict things substantially???

This seemed to have general consensus in #181 that this was reasonable, but we currently only have repertoire_id (RepertoireID from above) and data_processing_id (SoftwareProcessingID). SampleProcessingID was not added to the spec.

Reopening to revisit...

bcorrie commented 5 years ago

The problem that this is trying to solve is raised here: https://github.com/airr-community/airr-standards/issues/246#issuecomment-531381149

In the above issue we have the following example:

repertoire:
  repertoire_id: some-id
  sample:
    - sample_id: blood pre-treatment
    - sample_id: blood post-treatment
    - sample_id: tissue post-treatment
      disease_state_sample: cancer
  data_processing:
    - data_processing_id: 1
      primary_annotation: true

Consider the case where you have the above Repertoire file that describes some data of interest and an AIRR TSV with the accompanying data. You search the AIRR TSV file, and find a junction_aa of interest that you want to know more about. You get this information for the rearrangement that you found in the AIRR TSV:

sequence_id, repertoire_id, data_processing_id, v_call, d_call, j_call, junction_aa, ...

You can then look up the repertoire_id in the Repertoire file, to find out which repertoire this rearrangement came from. Lets assume repertoire_id == some_id and data_processing_id == 1 as in the above repertoire.

In this case, you have no way of telling if the rearrangement you found is from the "blood pre-treatment", "blood post-treatment", or "tissue post-treatment" sample.

Is this not a problem?

scharch commented 5 years ago

I'd argue that it's not a problem, but a decision by the person who merged those samples into a single repertoire that was no need to maintain the distinctions between them. Does that make the data less than infinitely reusable? Perhaps, but the whole point of the MiAIRR is that the raw data is well annotated and you can always go back to that if you need to. And you could never completely avoid that necessity, anyway. And I still feel, as I said in #181, that making the rearrangement schema more complicated is a bad idea.

bcorrie commented 5 years ago

I have added a "sample_processing_id" field to the rearrangement in branch issue-249 and added the the same sample_processing_id to the Repertoire object. The sample_processing_id should be unique within the Repertoire, similar to the way data_processing_id is unique. This should be sufficient to solve this problem from a disambiguation perspective and is a relatively minor change.

The repertoire record is now:

repertoire:
  repertoire_id: some-id
  sample:
    - sample_processing_id: 1, sample_id: blood pre-treatment
    - sample_processing_id: 2, sample_id: blood post-treatment
    - sample_processing_id: 3, sample_id: tissue post-treatment
      disease_state_sample: cancer
  data_processing:
    - data_processing_id: 1
      primary_annotation: true

The rearrangement record is now:

sequence_id, repertoire_id, sample_processing_id, data_processing_id, v_call, d_call, j_call, junction_aa, ...

One can then find the specific sample within the repertoire that the rearrangement came from. Also note that this makes the treatment of data_processing and sample processing symmetric.

Going in the other direction, when performing a search for all of the rearrangements from a specific sample, it is a bit more complicated. With this _id structure, one needs to search at the rearrangement level for a repertoire_id, data_processing_id, and a sample_processing_id to ensure that you get just the rearrangements from a single sample processing and a single data processing regime.

schristley commented 5 years ago

I guess the actual field should be sample_id as it's referring to a specific sample in the repertoire. As that's already a unique identifier in the repertoire, there doesn't seem a need to add another one.

I'm fine with adding it but it needs to be an optional field. In general, this is a n-to-n relation (input read to output rearrangement) due to read collapsing so it cannot be required as that's an expensive relation to maintain.

So if it is added, then it needs to be spec'ed and documentation about how and under what circumstances it is a valid mapping. Is it the complete n-to-n relation, or only when there is a 1-to-1 mapping between reads and rearrangements?

schristley commented 5 years ago

I'd argue that it's not a problem, but a decision by the person who merged those samples into a single repertoire that was no need to maintain the distinctions between them.

I agree and would add one bit of philosophy as well. One major point of the repertoire was to break the one-sample/one-file/one-analysis assumption. Many tools make the assumption that a sample is the level of analysis, when in fact it isn't. The repertoire breaks that link, and thus allows biological sampling to be independent of how those samples are analyzed.

bcorrie commented 5 years ago

I'd argue that it's not a problem, but a decision by the person who merged those samples into a single repertoire that was no need to maintain the distinctions between them. Does that make the data less than infinitely reusable? Perhaps, but the whole point of the MiAIRR is that the raw data is well annotated and you can always go back to that if you need to. And you could never completely avoid that necessity, anyway. And I still feel, as I said in #181, that making the rearrangement schema more complicated is a bad idea.

For me this is opening the door and making it easy for researchers to do bad things and encourages people to structure their studies in a way that is hard for reuse. I would argue that we should be providing a format that encourages as much reuse as possible...

A researcher may want to intentionally group those rearrangements together, but this structure makes it trivial for a researcher to accidentally group the rearrangements together as well... The way we structure the file format is we have made the Repertoire the key object. A Repertoire must be associated with a single subject but can have many samples. Therefore, we are "tempting" the researcher to put all samples from a single subject into a single repertoire. As soon as a researcher puts multiple samples in a Repertoire, you are immediately saying that you don't want to maintain the distinctions between rearrangements that are associate with the samples in that Repertoire (as there is currently no way to do that). It feels like the general case is that you will almost always want to map a specific rearrangement in a TSV file back to a biological sample and how it was processed (both sample processing and data processing).

Adding a sample_processing_id at the repertoire makes it trivially possible to differentiate sample processing within the repertoire. You don't have to differentiate them if you don't want to, but you can differentiate them if you do want to...

Adding a single new _id filed at the rearrangement level also trivially solves this, and given that we have 50+ fields already, I don't think adding one more _id field (in particular given its value) is a significant issue. If it is optional, then you don't have to provide it in a given AIRR TSV file. But having it makes it possible to resolve the above conflicts...

bcorrie commented 5 years ago

I guess the actual field should be sample_id as it's referring to a specific sample in the repertoire. As that's already a unique identifier in the repertoire, there doesn't seem a need to add another one.

That is what I thought as well initially, but I don't think that is sufficient. I think the correct analogy is that we are creating an id for sample_processing just like we do for data_processing. For example, if a single biological sample is cell sorted to create multiple cell subsets that are sequenced independently (or at least the researcher wants to differentiate between the rearrangements from those two cell subsets), then the sample_processing_id needs to differentiate between those two.

schristley commented 5 years ago

Adding a sample_processing_id at the repertoire makes it trivially possible

It is not trivial. Your design does not handle the n-to-n relation between reads and rearrangements, which is necessary to know the sample that rearrangement is from. Give us a design that handles all scenarios, otherwise it is trivially possible only for trivial data processing.

schristley commented 5 years ago

That is what I thought as well initially, but I don't think that is sufficient. I think the correct analogy is that we are creating an id for sample_processing just like we do for data_processing. For example, if a single biological sample is cell sorted to create multiple cell subsets that are sequenced independently (or at least the researcher wants to differentiate between the rearrangements from those two cell subsets), then the sample_processing_id needs to differentiate between those two.

Ah yeah, that's right.

bcorrie commented 5 years ago

I'm fine with adding it but it needs to be an optional field. In general, this is a n-to-n relation (input read to output rearrangement) due to read collapsing so it cannot be required as that's an expensive relation to maintain.

By optional, you mean that it doesn't need to be in the file to be compliant. I would agree with that on the Rearrangement level - many cases would not have more than one sample in a repertoire so the repertoire_id would be sufficient. I have changed the description to that effect.

I think that is true at the Repertoire level as well. This is the same "optional level" as data_processing_id. It is not required and therefore may or may not be provided for the Repertoire.

I do worry about making it easy for researchers to make mistakes if they are using the Repetoire format to describe a study. What does it mean if a researcher has a Repertoire with more than one data_processing object and no data_processing_id? Is that intentional or a mistake. Making it required and nullable reduces the chances of it being a mistake, as making it null requires an explicit action... The same holds true for sample_processing_id. Having one and making it null means that the researcher is explicitly saying that the samples are grouped together and it is not necessary and/or desirable to differentiate the rearrangements by sample.

javh commented 5 years ago

I think there's some value in having sample_id at the Rearrangement level as well, because it's quite common to merge files and annotate them with a custom column that defines the sample of origin (sample, donor, etc). Just having a reserved name for that optional, but common, field seems like a good idea.

Though... I'm starting to feel like there has to be a better mechanism for defining metadata fields that are copied down to the Rearrangement level as 1-to-many identifiers. Do we need to have all these as part of the Rearrangement spec? Or can we just setup some rules to facilitate propagating fields "downward" (eg, id fields need unique names across a specs).

(Seems like this also applies to fields like germline_database that we moved "up".)

bcorrie commented 5 years ago

Adding a sample_processing_id at the repertoire makes it trivially possible

It is not trivial. Your design does not handle the n-to-n relation between reads and rearrangements, which is necessary to know the sample that rearrangement is from. Give us a design that handles all scenarios, otherwise it is trivially possible only for trivial data processing.

Sorry, not following what you are asking for here... And I'm not sure I am promising to solve the general case... What I am trying to get to is an improvement on what we have now... 8-)

What I said was:

Adding a sample_processing_id at the repertoire makes it trivially possible to differentiate sample processing within the repertoire.

First, I will say it is trivially possible if the researcher does it... They don't have to, but this at least makes it possible. In out current spec, this is not possible.

What adding a sample_processing_id at the Repertoire level trivially solves is the ability for a researcher to uniquely identify a set of sample_processing regimes within a Repertoire. That is, each element in the sample array in the Repertoire object has a unique ID within that repertoire so that it can be referred to externally at the rearrangement level. If the rearrangement also has a sample_processing_id that, combined with the repertoire_id, maps back to a specific sample_processing regime in a specific repertoire . This solves the 1-n relationship between a sample_processing regime in a specific repertoire to the rearrangements that were produced in the sequencing run that the sample processing regime describes.

That is the improvement that I am suggesting at the moment...

schristley commented 5 years ago

By optional, you mean that it doesn't need to be in the file to be compliant.

Correct. But also optional in that researchers should not rely on being able to do that mapping. Because your spec only handles a special case, tools shouldn't assume that they can always get back to the sample processing for a rearrangement.

schristley commented 5 years ago

each element in the sample array in the Repertoire object has a unique ID within that repertoire so that it can be referred to externally...

That's reasonable and useful, so I'm good with that.

schristley commented 5 years ago

Sorry, not following what you are asking for here...

You haven't handled the case of when multiple reads (potentially from multiple samples) get collapsed into a single rearrangement object.

bcorrie commented 5 years ago

That is the improvement that I am suggesting at the moment...

I suppose I should point out for completeness that by having a repertoire_id, a sample_processing_id, and a data_processing_id we also have the ability to address the 1-N problem of going from a specific repertoire, with a specific sample_processing regime and specific data_processing object to a set of N rearrangements.

schristley commented 5 years ago

I suppose I should point out for completeness that by having a repertoire_id, a sample_processing_id, and a data_processing_id we also have the ability to address the 1-N problem of going from a specific repertoire, with a specific sample_processing regime and specific data_processing object to a set of N rearrangements.

Partially, but incompletely. It does not handle that for all cases. Unlike repertoire_id and data_processing_id, which do guarantee that you get all rearrangements that are processed together in that repertoire, adding sample_processing_id does not guarantee that you get all rearrangements from the same sample for that repertoire.

bcorrie commented 5 years ago

Sorry, not following what you are asking for here...

You haven't handled the case of when multiple reads (potentially from multiple samples) get collapsed into a single rearrangement object.

I think my suggested solution is as flexible as the current one, it just give us more power. It does not prevent the researcher from doing what you had in your previous example.

If a single rearrangement object (your collapsed read) doesn't have a sample_processing_id and the repertoire_id that the rearrangement does refer to has multiple samples in the the sample array, then this implies that the rearrangement comes from multiple samples. This takes care of that case does it not? This works exactly the same way as before.

bcorrie commented 5 years ago

I suppose I should point out for completeness that by having a repertoire_id, a sample_processing_id, and a data_processing_id we also have the ability to address the 1-N problem of going from a specific repertoire, with a specific sample_processing regime and specific data_processing object to a set of N rearrangements.

Partially, but incompletely. It does not handle that for all cases. Unlike repertoire_id and data_processing_id, which do guarantee that you get all rearrangements that are processed together in that repertoire, adding sample_processing_id does not guarantee that you get all rearrangements from the same sample for that repertoire.

Yes, I agree, but for me that is not the intent... sample_processing_id is about being able to differentiate between how samples are processed in a specific repertoire (the entities in the sample array in Repertoire) in preparation for data_processing.

I think the query you are talking about is actually quite complex... The data "from a single sample" may be described across several repertoires and in many different ways, depending on how the researcher chooses to do this...

schristley commented 5 years ago

Yes, I agree, but for me that is not the intent... sample_processing_id is about being able to differentiate between how samples are processed in a specific repertoire (the entities in the sample array in Repertoire) in preparation for data_processing.

I'm okay with that. I was just concerned because all the discussion/justification was about mapping rearrangements back to samples, which is problematic.

If we are mainly talking about giving that sample processing chunk in the repertoire metadata a unique identifier then I think that's a useful thing to have.

scharch commented 5 years ago

This came up a bit on the datarep call, but I am ok adding sample_processing_id as a reserved, optional field name along the lines suggested by @javh above.

schristley commented 5 years ago

This came up a bit on the datarep call, but I am ok adding sample_processing_id as a reserved, optional field name along the lines suggested by @javh above.

If we put it in the repertoire metadata then it essentially becomes required like data_processing_id because its purpose is to differentiate those sample processing objects in the sample array. But it's optional in the rearrangement schema.

scharch commented 5 years ago

Yes, but I thought we were only taking about the rearrangement schema anyway.

On Mon, Sep 16, 2019, 3:21 PM Scott Christley notifications@github.com wrote:

This came up a bit on the datarep call, but I am ok adding sample_processing_id as a reserved, optional field name along the lines suggested by @javh https://github.com/javh above.

If we put it in the repertoire metadata then it essentially becomes required like data_processing_id because its purpose is to differentiate those sample processing objects in the sample array. But it's optional in the rearrangement schema.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/airr-community/airr-standards/issues/249?email_source=notifications&email_token=ABTF5YFMP76RAZGDUWSQUWDQJ7MFBA5CNFSM4IXD2H42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD62HMCY#issuecomment-531920395, or mute the thread https://github.com/notifications/unsubscribe-auth/ABTF5YC33VUTGZRTG3FNYM3QJ7MFBANCNFSM4IXD2H4Q .

bcorrie commented 5 years ago

Yes, but I thought we were only taking about the rearrangement schema anyway.

Not quite, because this is an _id that is linking the Rearrangement object to a Repertoire object. The sample_processing_id is "defined" in the Repertoire by the researcher and then the researcher can link a specific Rearrangement object to how the sample was processed for a specific Repertoire by specifying the sample_processing_id and the repertoire_id.

schristley commented 5 years ago

@bcorrie I committed some changes to the branch. Do you use a swagger editor? I use this one, and it wasn't showing sample_processing_id when looking at the return types like repertoire_list. The OpenAPI v2 spec doesn't really say whether its okay to mix allOf with other fields, you would think it would be but that editor wasn't showing it. Though the spec does say that allOf needs to be a list of object. I had to put SampleProcessing into it's own object like DataProcessing in order for sample_processing_id to be recognized.

bcorrie commented 5 years ago

@schristley I use the same swagger editor... And I noticed that as well, but when I went to check the allOf without the sample_processing_id, I thought I noticed the same thing... So I thought it was a problem with allOf and the problem was always there with the swagger editor.

I just went back to check though, and sure enough, on the master branch with the allOf the swagger editor is working as expected (it shows the entire response). So I am not sure what I was looking at when I double checked...

Do you have a solution or do you want me to play with it???

schristley commented 5 years ago

@bcorrie checkout my schema changes on the branch. I added a SoftwareProcessing object, seems odd but was the only way I could get it to work.

bcorrie commented 5 years ago

@schristley in swagger I am still seeing a "sample" : [ null ], element for the sample response in /repertoire/{repertoire_id}

This is on the issue-249 branch still?

bcorrie commented 5 years ago

When I expand out the object in the Models section at the bottom of the swagger editor it has sample_processing_id in the repertoire_list.

When I look at the example response in the API entry point section it does not have it...

schristley commented 5 years ago

yeah, I was looking in the Models section when testing. The response object seems to be an example object with example data stuck in, maybe it doesn't know how to handle allOf for the example?

schristley commented 5 years ago

hmm, it does seem to work on the API in the master branch.

schristley commented 5 years ago

well, that's weird, I cleared the editor, plugged in raw url for master branch, and it shows example sample, then I cleared the editor again, plugged in raw url for issue-249 branch, and now its showing an example response.

schristley commented 5 years ago

if I click browser refresh, the sample goes back to null, so there is some flakiness in the editor I think.

bcorrie commented 5 years ago

OK, very weird - as I just did the same...

bcorrie commented 5 years ago

I am unable to reproduce a set of steps in the swagger editor that will actually show the response with sample_processing_id, even thought I think we are both convinced that we have actually seen it working 8-)

schristley commented 5 years ago

yeah, it's still making me wonder if we need that SampleProcessing object or whether your original schema was sufficient...

bcorrie commented 5 years ago

It is odd as in the swagger editor the Model for repertoire_list is correct in the models section, but if you look in the Model view of the actual API response it is missing. So I think it is a swagger editor issue.

Maybe try the original schema and check to see if it works in the Model section at the bottom of the swagger editor??? If that works, maybe we can consider it working (unless your middleware has the same issue)?

schristley commented 5 years ago

Maybe try the original schema and check to see if it works in the Model section at the bottom of the swagger editor??? If that works, maybe we can consider it working (unless your middleware has the same issue)?

I did and sample_processing_id wasn't showing up, which was why I changed the schema.

I haven't tried loading it up into the API service yet.