Closed bcorrie closed 5 years ago
The problem that this is trying to solve is raised here: https://github.com/airr-community/airr-standards/issues/246#issuecomment-531381149
In the above issue we have the following example:
repertoire:
repertoire_id: some-id
sample:
- sample_id: blood pre-treatment
- sample_id: blood post-treatment
- sample_id: tissue post-treatment
disease_state_sample: cancer
data_processing:
- data_processing_id: 1
primary_annotation: true
Consider the case where you have the above Repertoire file that describes some data of interest and an AIRR TSV with the accompanying data. You search the AIRR TSV file, and find a junction_aa of interest that you want to know more about. You get this information for the rearrangement that you found in the AIRR TSV:
sequence_id, repertoire_id, data_processing_id, v_call, d_call, j_call, junction_aa, ...
You can then look up the repertoire_id in the Repertoire file, to find out which repertoire this rearrangement came from. Lets assume repertoire_id == some_id and data_processing_id == 1 as in the above repertoire.
In this case, you have no way of telling if the rearrangement you found is from the "blood pre-treatment", "blood post-treatment", or "tissue post-treatment" sample.
Is this not a problem?
I'd argue that it's not a problem, but a decision by the person who merged those samples into a single repertoire that was no need to maintain the distinctions between them. Does that make the data less than infinitely reusable? Perhaps, but the whole point of the MiAIRR is that the raw data is well annotated and you can always go back to that if you need to. And you could never completely avoid that necessity, anyway. And I still feel, as I said in #181, that making the rearrangement
schema more complicated is a bad idea.
I have added a "sample_processing_id" field to the rearrangement in branch issue-249 and added the the same sample_processing_id to the Repertoire object. The sample_processing_id should be unique within the Repertoire, similar to the way data_processing_id is unique. This should be sufficient to solve this problem from a disambiguation perspective and is a relatively minor change.
The repertoire record is now:
repertoire:
repertoire_id: some-id
sample:
- sample_processing_id: 1, sample_id: blood pre-treatment
- sample_processing_id: 2, sample_id: blood post-treatment
- sample_processing_id: 3, sample_id: tissue post-treatment
disease_state_sample: cancer
data_processing:
- data_processing_id: 1
primary_annotation: true
The rearrangement record is now:
sequence_id, repertoire_id, sample_processing_id, data_processing_id, v_call, d_call, j_call, junction_aa, ...
One can then find the specific sample within the repertoire that the rearrangement came from. Also note that this makes the treatment of data_processing and sample processing symmetric.
Going in the other direction, when performing a search for all of the rearrangements from a specific sample, it is a bit more complicated. With this _id structure, one needs to search at the rearrangement level for a repertoire_id, data_processing_id, and a sample_processing_id to ensure that you get just the rearrangements from a single sample processing and a single data processing regime.
I guess the actual field should be sample_id
as it's referring to a specific sample in the repertoire. As that's already a unique identifier in the repertoire, there doesn't seem a need to add another one.
I'm fine with adding it but it needs to be an optional field. In general, this is a n-to-n relation (input read to output rearrangement) due to read collapsing so it cannot be required as that's an expensive relation to maintain.
So if it is added, then it needs to be spec'ed and documentation about how and under what circumstances it is a valid mapping. Is it the complete n-to-n relation, or only when there is a 1-to-1 mapping between reads and rearrangements?
I'd argue that it's not a problem, but a decision by the person who merged those samples into a single repertoire that was no need to maintain the distinctions between them.
I agree and would add one bit of philosophy as well. One major point of the repertoire
was to break the one-sample/one-file/one-analysis assumption. Many tools make the assumption that a sample
is the level of analysis, when in fact it isn't. The repertoire
breaks that link, and thus allows biological sampling to be independent of how those samples are analyzed.
I'd argue that it's not a problem, but a decision by the person who merged those samples into a single repertoire that was no need to maintain the distinctions between them. Does that make the data less than infinitely reusable? Perhaps, but the whole point of the MiAIRR is that the raw data is well annotated and you can always go back to that if you need to. And you could never completely avoid that necessity, anyway. And I still feel, as I said in #181, that making the
rearrangement
schema more complicated is a bad idea.
For me this is opening the door and making it easy for researchers to do bad things and encourages people to structure their studies in a way that is hard for reuse. I would argue that we should be providing a format that encourages as much reuse as possible...
A researcher may want to intentionally group those rearrangements together, but this structure makes it trivial for a researcher to accidentally group the rearrangements together as well... The way we structure the file format is we have made the Repertoire the key object. A Repertoire must be associated with a single subject but can have many samples. Therefore, we are "tempting" the researcher to put all samples from a single subject into a single repertoire. As soon as a researcher puts multiple samples in a Repertoire, you are immediately saying that you don't want to maintain the distinctions between rearrangements that are associate with the samples in that Repertoire (as there is currently no way to do that). It feels like the general case is that you will almost always want to map a specific rearrangement in a TSV file back to a biological sample and how it was processed (both sample processing and data processing).
Adding a sample_processing_id at the repertoire makes it trivially possible to differentiate sample processing within the repertoire. You don't have to differentiate them if you don't want to, but you can differentiate them if you do want to...
Adding a single new _id filed at the rearrangement level also trivially solves this, and given that we have 50+ fields already, I don't think adding one more _id field (in particular given its value) is a significant issue. If it is optional, then you don't have to provide it in a given AIRR TSV file. But having it makes it possible to resolve the above conflicts...
I guess the actual field should be
sample_id
as it's referring to a specific sample in the repertoire. As that's already a unique identifier in the repertoire, there doesn't seem a need to add another one.
That is what I thought as well initially, but I don't think that is sufficient. I think the correct analogy is that we are creating an id for sample_processing just like we do for data_processing. For example, if a single biological sample is cell sorted to create multiple cell subsets that are sequenced independently (or at least the researcher wants to differentiate between the rearrangements from those two cell subsets), then the sample_processing_id needs to differentiate between those two.
Adding a sample_processing_id at the repertoire makes it trivially possible
It is not trivial. Your design does not handle the n-to-n relation between reads and rearrangements, which is necessary to know the sample that rearrangement is from. Give us a design that handles all scenarios, otherwise it is trivially possible only for trivial data processing.
That is what I thought as well initially, but I don't think that is sufficient. I think the correct analogy is that we are creating an id for sample_processing just like we do for data_processing. For example, if a single biological sample is cell sorted to create multiple cell subsets that are sequenced independently (or at least the researcher wants to differentiate between the rearrangements from those two cell subsets), then the sample_processing_id needs to differentiate between those two.
Ah yeah, that's right.
I'm fine with adding it but it needs to be an optional field. In general, this is a n-to-n relation (input read to output rearrangement) due to read collapsing so it cannot be required as that's an expensive relation to maintain.
By optional, you mean that it doesn't need to be in the file to be compliant. I would agree with that on the Rearrangement level - many cases would not have more than one sample in a repertoire so the repertoire_id would be sufficient. I have changed the description to that effect.
I think that is true at the Repertoire level as well. This is the same "optional level" as data_processing_id. It is not required and therefore may or may not be provided for the Repertoire.
I do worry about making it easy for researchers to make mistakes if they are using the Repetoire format to describe a study. What does it mean if a researcher has a Repertoire with more than one data_processing object and no data_processing_id? Is that intentional or a mistake. Making it required and nullable reduces the chances of it being a mistake, as making it null requires an explicit action... The same holds true for sample_processing_id. Having one and making it null means that the researcher is explicitly saying that the samples are grouped together and it is not necessary and/or desirable to differentiate the rearrangements by sample.
I think there's some value in having sample_id
at the Rearrangement
level as well, because it's quite common to merge files and annotate them with a custom column that defines the sample of origin (sample, donor, etc). Just having a reserved name for that optional, but common, field seems like a good idea.
Though... I'm starting to feel like there has to be a better mechanism for defining metadata fields that are copied down to the Rearrangement
level as 1-to-many identifiers. Do we need to have all these as part of the Rearrangement
spec? Or can we just setup some rules to facilitate propagating fields "downward" (eg, id fields need unique names across a specs).
(Seems like this also applies to fields like germline_database
that we moved "up".)
Adding a sample_processing_id at the repertoire makes it trivially possible
It is not trivial. Your design does not handle the n-to-n relation between reads and rearrangements, which is necessary to know the sample that rearrangement is from. Give us a design that handles all scenarios, otherwise it is trivially possible only for trivial data processing.
Sorry, not following what you are asking for here... And I'm not sure I am promising to solve the general case... What I am trying to get to is an improvement on what we have now... 8-)
What I said was:
Adding a sample_processing_id at the repertoire makes it trivially possible to differentiate sample processing within the repertoire.
First, I will say it is trivially possible if the researcher does it... They don't have to, but this at least makes it possible. In out current spec, this is not possible.
What adding a sample_processing_id at the Repertoire level trivially solves is the ability for a researcher to uniquely identify a set of sample_processing regimes within a Repertoire. That is, each element in the sample array in the Repertoire object has a unique ID within that repertoire so that it can be referred to externally at the rearrangement level. If the rearrangement also has a sample_processing_id that, combined with the repertoire_id, maps back to a specific sample_processing regime in a specific repertoire . This solves the 1-n relationship between a sample_processing regime in a specific repertoire to the rearrangements that were produced in the sequencing run that the sample processing regime describes.
That is the improvement that I am suggesting at the moment...
By optional, you mean that it doesn't need to be in the file to be compliant.
Correct. But also optional in that researchers should not rely on being able to do that mapping. Because your spec only handles a special case, tools shouldn't assume that they can always get back to the sample processing for a rearrangement.
each element in the sample array in the Repertoire object has a unique ID within that repertoire so that it can be referred to externally...
That's reasonable and useful, so I'm good with that.
Sorry, not following what you are asking for here...
You haven't handled the case of when multiple reads (potentially from multiple samples) get collapsed into a single rearrangement object.
That is the improvement that I am suggesting at the moment...
I suppose I should point out for completeness that by having a repertoire_id, a sample_processing_id, and a data_processing_id we also have the ability to address the 1-N problem of going from a specific repertoire, with a specific sample_processing regime and specific data_processing object to a set of N rearrangements.
I suppose I should point out for completeness that by having a repertoire_id, a sample_processing_id, and a data_processing_id we also have the ability to address the 1-N problem of going from a specific repertoire, with a specific sample_processing regime and specific data_processing object to a set of N rearrangements.
Partially, but incompletely. It does not handle that for all cases. Unlike repertoire_id
and data_processing_id
, which do guarantee that you get all rearrangements that are processed together in that repertoire, adding sample_processing_id
does not guarantee that you get all rearrangements from the same sample
for that repertoire.
Sorry, not following what you are asking for here...
You haven't handled the case of when multiple reads (potentially from multiple samples) get collapsed into a single rearrangement object.
I think my suggested solution is as flexible as the current one, it just give us more power. It does not prevent the researcher from doing what you had in your previous example.
If a single rearrangement object (your collapsed read) doesn't have a sample_processing_id and the repertoire_id that the rearrangement does refer to has multiple samples in the the sample array, then this implies that the rearrangement comes from multiple samples. This takes care of that case does it not? This works exactly the same way as before.
I suppose I should point out for completeness that by having a repertoire_id, a sample_processing_id, and a data_processing_id we also have the ability to address the 1-N problem of going from a specific repertoire, with a specific sample_processing regime and specific data_processing object to a set of N rearrangements.
Partially, but incompletely. It does not handle that for all cases. Unlike
repertoire_id
anddata_processing_id
, which do guarantee that you get all rearrangements that are processed together in that repertoire, addingsample_processing_id
does not guarantee that you get all rearrangementsfrom the same sample
for that repertoire.
Yes, I agree, but for me that is not the intent... sample_processing_id is about being able to differentiate between how samples are processed in a specific repertoire (the entities in the sample array in Repertoire) in preparation for data_processing.
I think the query you are talking about is actually quite complex... The data "from a single sample" may be described across several repertoires and in many different ways, depending on how the researcher chooses to do this...
Yes, I agree, but for me that is not the intent... sample_processing_id is about being able to differentiate between how samples are processed in a specific repertoire (the entities in the sample array in Repertoire) in preparation for data_processing.
I'm okay with that. I was just concerned because all the discussion/justification was about mapping rearrangements back to samples, which is problematic.
If we are mainly talking about giving that sample processing chunk in the repertoire metadata a unique identifier then I think that's a useful thing to have.
This came up a bit on the datarep call, but I am ok adding sample_processing_id
as a reserved, optional field name along the lines suggested by @javh above.
This came up a bit on the datarep call, but I am ok adding sample_processing_id as a reserved, optional field name along the lines suggested by @javh above.
If we put it in the repertoire metadata then it essentially becomes required like data_processing_id
because its purpose is to differentiate those sample processing objects in the sample
array. But it's optional in the rearrangement schema.
Yes, but I thought we were only taking about the rearrangement schema anyway.
On Mon, Sep 16, 2019, 3:21 PM Scott Christley notifications@github.com wrote:
This came up a bit on the datarep call, but I am ok adding sample_processing_id as a reserved, optional field name along the lines suggested by @javh https://github.com/javh above.
If we put it in the repertoire metadata then it essentially becomes required like data_processing_id because its purpose is to differentiate those sample processing objects in the sample array. But it's optional in the rearrangement schema.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/airr-community/airr-standards/issues/249?email_source=notifications&email_token=ABTF5YFMP76RAZGDUWSQUWDQJ7MFBA5CNFSM4IXD2H42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD62HMCY#issuecomment-531920395, or mute the thread https://github.com/notifications/unsubscribe-auth/ABTF5YC33VUTGZRTG3FNYM3QJ7MFBANCNFSM4IXD2H4Q .
Yes, but I thought we were only taking about the rearrangement schema anyway.
Not quite, because this is an _id that is linking the Rearrangement object to a Repertoire object. The sample_processing_id is "defined" in the Repertoire by the researcher and then the researcher can link a specific Rearrangement object to how the sample was processed for a specific Repertoire by specifying the sample_processing_id and the repertoire_id.
@bcorrie I committed some changes to the branch. Do you use a swagger editor? I use this one, and it wasn't showing sample_processing_id
when looking at the return types like repertoire_list
. The OpenAPI v2 spec doesn't really say whether its okay to mix allOf
with other fields, you would think it would be but that editor wasn't showing it. Though the spec does say that allOf
needs to be a list of object. I had to put SampleProcessing
into it's own object like DataProcessing
in order for sample_processing_id
to be recognized.
@schristley I use the same swagger editor... And I noticed that as well, but when I went to check the allOf without the sample_processing_id, I thought I noticed the same thing... So I thought it was a problem with allOf and the problem was always there with the swagger editor.
I just went back to check though, and sure enough, on the master branch with the allOf the swagger editor is working as expected (it shows the entire response). So I am not sure what I was looking at when I double checked...
Do you have a solution or do you want me to play with it???
@bcorrie checkout my schema changes on the branch. I added a SoftwareProcessing
object, seems odd but was the only way I could get it to work.
@schristley in swagger I am still seeing a "sample" : [ null ], element for the sample response in /repertoire/{repertoire_id}
This is on the issue-249 branch still?
When I expand out the object in the Models section at the bottom of the swagger editor it has sample_processing_id in the repertoire_list.
When I look at the example response in the API entry point section it does not have it...
yeah, I was looking in the Models section when testing. The response object seems to be an example object with example data stuck in, maybe it doesn't know how to handle allOf
for the example?
hmm, it does seem to work on the API in the master branch.
well, that's weird, I cleared the editor, plugged in raw url for master branch, and it shows example sample, then I cleared the editor again, plugged in raw url for issue-249 branch, and now its showing an example response.
if I click browser refresh, the sample goes back to null, so there is some flakiness in the editor I think.
OK, very weird - as I just did the same...
I am unable to reproduce a set of steps in the swagger editor that will actually show the response with sample_processing_id, even thought I think we are both convinced that we have actually seen it working 8-)
yeah, it's still making me wonder if we need that SampleProcessing
object or whether your original schema was sufficient...
It is odd as in the swagger editor the Model for repertoire_list is correct in the models section, but if you look in the Model view of the actual API response it is missing. So I think it is a swagger editor issue.
Maybe try the original schema and check to see if it works in the Model section at the bottom of the swagger editor??? If that works, maybe we can consider it working (unless your middleware has the same issue)?
Maybe try the original schema and check to see if it works in the Model section at the bottom of the swagger editor??? If that works, maybe we can consider it working (unless your middleware has the same issue)?
I did and sample_processing_id
wasn't showing up, which was why I changed the schema.
I haven't tried loading it up into the API service yet.
In https://github.com/airr-community/airr-standards/issues/181#issuecomment-478657199
We discussed having three _id fields at the rearrangement level to disambiguate mapping rearrangements back to different combinations of repertoires, samples, and data processing.
This seemed to have general consensus in #181 that this was reasonable, but we currently only have repertoire_id (RepertoireID from above) and data_processing_id (SoftwareProcessingID). SampleProcessingID was not added to the spec.
Reopening to revisit...