Open bcorrie opened 2 years ago
- From an ADC perspective, repertoire_ids are typically unique within a given context (in the ADC they are unique within a repository). It seems useful to add a
repertoire_group_url
to the fields of RepertoireGroup to uniquely identify the source of the repertoire and where it was found. Or do we consider this implicit in the sense that one would have a separate RepertoireGroup for each repository in an analysis. This implicit definition is challenging if you want to have a RepertoireGroup that is grouping repertoires from multilple repositories.
No, I would not want the implicit definition, because it would be very useful (and possibly a quite common use case) to group repertoires across multiple repositories. I'm assuming this use case actually with VDJServer.
From an ADC perspective, I think all identifiers (which are meant to be FAIR) should be PIDs, except those few like subject_id and sample_id which are local to a study. I don't think that a separate repertoire_group_url
field is needed though as the repertoire_group_id
can use a CURIE (or decentralized identifer).
But it is an open question on whether RepertoireGroup is a top-level object that has its own ADC entry point, and thus needs a PID? It's maybe possible that RepertoireGroup is embedded within the objects that use it, e.g. DataProcessing.
- I am not sure only having
repertoire_id
is sufficient to completely describe the groups of repertoires that we might want to group together. For example, if we wanted a RepertoireGroup that only contained IGH and I have a different SampeProcessing or DataProcessing for these rearrangements, I would need to provide asample_processsing_id
or adata_processing_id
to accurately describe the set of rearrangements that are being considered as part of this RepertoireGroup.
Yes but that's purposefully not in the the scope of RepertoireGroup. An analogy would be wanting only the productive rearrangements within a Repertoire; this is not defined with Repertoire but in the DataProcessing that acts upon that repertoire. The same would be true with RepertoireGroup. With the redesign of DataProcessing, it points to Repertoire and RepertoireGroup, instead of vice versa.
- Do we really want a TimePoint object in the RepertoireGroup? This seems to imply that RepertoireGroups are targeted at data over multiple time points. Although that might be a common use case, it is certainly not the only one that make sense...
Yes so RepertoireGroup can be re-used for both: (1) a set of repertoires, and (2) a sequence of repertoires for a time course. It's optional, so leave it out if it doesn't apply.
- From an ADC perspective, repertoire_ids are typically unique within a given context (in the ADC they are unique within a repository). It seems useful to add a
repertoire_group_url
to the fields of RepertoireGroup to uniquely identify the source of the repertoire and where it was found. Or do we consider this implicit in the sense that one would have a separate RepertoireGroup for each repository in an analysis. This implicit definition is challenging if you want to have a RepertoireGroup that is grouping repertoires from multilple repositories.No, I would not want the implicit definition, because it would be very useful (and possibly a quite common use case) to group repertoires across multiple repositories. I'm assuming this use case actually with VDJServer.
Yes, I agree, same here. When you do a download on the Gateway, you are essentially creating a RepertoireGroup
from multiple repositories. I suggested a URL as a mechanism to define the repository because it is easy and we can define that field today and use it. PIDs are challenging to get agreement on how and when - which means we don't have a usable RepertoireGroup until we sort out the harder problem - which is going to be when??? 8-)
- Do we really want a TimePoint object in the RepertoireGroup? This seems to imply that RepertoireGroups are targeted at data over multiple time points. Although that might be a common use case, it is certainly not the only one that make sense...
Yes so RepertoireGroup can be re-used for both: (1) a set of repertoires, and (2) a sequence of repertoires for a time course. It's optional, so leave it out if it doesn't apply.
It just seems to be promoting time point as a "special case" when to me really it is one of many criteria/fields from the Repertoire metadata that you may use to group repertoires.
Why is grouping via time point a more important grouping criteria than disease state or tissue type.
But it is an open question on whether RepertoireGroup is a top-level object that has its own ADC entry point, and thus needs a PID? It's maybe possible that RepertoireGroup is embedded within the objects that use it, e.g. DataProcessing.
I don't think our repositories would store RepetoireGroups
(and therefore you wouldn't need an API to look them up), but I could see how you might want to return a RepertoireGroup
as an object. For example, the /airr/v1/repertoire end point could be asked to return a Repertoire
JSON or a RepertoireGroup
JSON object. The later would provide a concise list of the Repertoire
objects that your query produced rather than the JSON for all of the metadata for those Repertoires
.
- From an ADC perspective, repertoire_ids are typically unique within a given context (in the ADC they are unique within a repository). It seems useful to add a
repertoire_group_url
to the fields of RepertoireGroup to uniquely identify the source of the repertoire and where it was found. Or do we consider this implicit in the sense that one would have a separate RepertoireGroup for each repository in an analysis. This implicit definition is challenging if you want to have a RepertoireGroup that is grouping repertoires from multilple repositories.No, I would not want the implicit definition, because it would be very useful (and possibly a quite common use case) to group repertoires across multiple repositories. I'm assuming this use case actually with VDJServer.
Yes, I agree, same here. When you do a download on the Gateway, you are essentially creating a
RepertoireGroup
from multiple repositories. I suggested a URL as a mechanism to define the repository because it is easy and we can define that field today and use it. PIDs are challenging to get agreement on how and when - which means we don't have a usable RepertoireGroup until we sort out the harder problem - which is going to be when??? 8-)
Well, you are the one that created this issue and wanted "resolvability" before RepertoireGroups are usable ;-D You are welcome to add a repertoire_group_url
to identify the source; I just don't believe that solves the problem of "grouping repertoires from multilple repositories".
IMO, to use RepertoireGroup on the Gateway today, we don't need agreement on the details of PIDs or resolvable IDs. All we need is agreement that the repertoire_id
returned by a repository is globally unique, so that tools can assume that (e.g. the Gateway can create a RepertoireGroup with repertoires from multiple repositories and not worry that the repertoire_id
will conflict).
- Do we really want a TimePoint object in the RepertoireGroup? This seems to imply that RepertoireGroups are targeted at data over multiple time points. Although that might be a common use case, it is certainly not the only one that make sense...
Yes so RepertoireGroup can be re-used for both: (1) a set of repertoires, and (2) a sequence of repertoires for a time course. It's optional, so leave it out if it doesn't apply.
It just seems to be promoting time point as a "special case" when to me really it is one of many criteria/fields from the Repertoire metadata that you may use to group repertoires.
Why is grouping via time point a more important grouping criteria than disease state or tissue type.
The difference is that I consider that time course is not a grouping criteria like disease state or tissue type. In those examples, you are referring to a property of Repertoire (like disease_state_sample
or tissue
). The time course is an attribute on the group itself, it says the group is a sequence of repertoires (versus a set of repertoires) and thus the order of repertoires has importance. But whatever, that's discussing semantics, which tends to get us nowhere, because we have different viewpoints, so let's talk specific changes instead.
Do you want to eliminate the time point from RepertoireGroup? If so, then we still need a mechanism to describe a sequence of repertoires which can have user-defined labels, that IMO can be used to specify generic ordering but also time course data, what is your suggestion for how to specify that in RepertoireGroup?
I don't think our repositories would store
RepetoireGroups
(and therefore you wouldn't need an API to look them up), but I could see how you might want to return aRepertoireGroup
as an object. For example, the /airr/v1/repertoire end point could be asked to return aRepertoire
JSON or aRepertoireGroup
JSON object.
So then the results of the analysis on that RepertoireGroup
couldn't be returned to the ADC?
In theory, this should be cheap, as it will contain neither Rearrangement
s nor Cell
s --a RepertoireGroup
is really just a query, in some sense. On the other hand, putting it that way makes it seem like a separate endpoint might be redundant (querying for queries)...?
I don't think our repositories would store
RepetoireGroups
(and therefore you wouldn't need an API to look them up), but I could see how you might want to return aRepertoireGroup
as an object. For example, the /airr/v1/repertoire end point could be asked to return aRepertoire
JSON or aRepertoireGroup
JSON object.So then the results of the analysis on that
RepertoireGroup
couldn't be returned to the ADC?In theory, this should be cheap, as it will contain neither
Rearrangement
s norCell
s --aRepertoireGroup
is really just a query, in some sense. On the other hand, putting it that way makes it seem like a separate endpoint might be redundant (querying for queries)...?
VDJServer will store RepertoireGroups
just like it currently stores "groups" that users define for comparative analysis. We will just switch from our internal group format to AIRR RepertoireGroup. These are nice provenance objects about the cohorts, which can be hard to reverse-engineer from SRA records. Though, I think having a query end point for them is a bit overkill, more likely they can be returned as additional objects with Repertoire or DataProcessing. Likewise, creating a RepertoireGroup to "group" the repertoires returned from an ADC query is a convenience, but you are only saving yourself one line of code, something like this:
rg['repertoires'] = [ { repertoire_id: rep['repertoire_id'] } for rep in data['Repertoire'] ]
Right now, none of the studies in the ADC have cohorts defined as RepertoireGroups. That would be nice to have though.
I just revisited the current definition and the discussion above. Based on all of that, I think the RepertoireGroup
that we have now seems fairly good for what was its original intent. For example, given the discussion in #548 I could see creating a RepertoireGroup
that encompassed the entire download from the iReceptor Gateway.
This manifest:
{
"Info": {
"title": "AIRR Manifest",
"version": "3.0",
"description": "List of files for each repository",
"contact": {
"name": "iReceptor Gateway",
"url": "https://gateway.ireceptor.org",
"email": "support@ireceptor.org"
}
},
"DataSets": [
{
"dataset_name": "VDJServer",
"repository_url": "https://vdjserver.org/airr/v1/",
"dataset": [
{ "data_type" : "Repertoire", "data_files" : [ "vdjserver-metadata.json"] },
{ "data_type" : "Rearrangement", "data_files" : ["vdjserver.tsv"] }
]
},
{
"dataset_name": "VDJbase",
"repository_url": "https://airr-seq.vdjbase.org/airr/v1/",
"dataset": [
{ "data_type" : "Repertoire", "data_files" : [ "vdjbase-metadata.json"] },
{ "data_type" : "Rearrangement", "data_files" : ["vdjbase.tsv"] }
]
},
{
"dataset_name": "COVID 19-1",
"repository_url": "https://covid19-1.ireceptor.org/airr/v1/",
"dataset": [
{ "data_type" : "Repertoire", "data_files" : [ "airr-covid-19-metadata.json"] },
{ "data_type" : "Rearrangement", "data_files" : ["airr-covid-19.tsv"] }
]
}
]
}
Could have a RepertoireGroup
that looked like this:
{
"repertoire_group_id" : "globally unique id 1",
# Info about the download and when it happened
"repertoire_group_name" : "iReceptor download - bcorrie - 2024-02-09",
# Human readable description of Repertoire filters used to generate the data.
# We do this for download today already
"repertoire_group_description" : "Diagnosis (Ontology ID): DOID:9952 or DOID:9119, PCR target: TRA",
"repertoires": [
{"repertoire_id": "8178786653546746346-242ac113-0001-012"},
{"repertoire_id": "8196138321422586346-242ac113-0001-012"},
{"repertoire_id": "8213318190606586346-242ac113-0001-012"}
[Stuff deleted from other repertoire (this could be quite long)]
]
}
I could also see having a RepertoireGroup
per repository, so part of my manifest might be:
"DataSets": [
{
"dataset_name": "VDJServer",
"repository_url": "https://vdjserver.org/airr/v1/",
"dataset": [
{ "data_type" : "RepertoireGroup", "data_files" : [ "vdjserver-repertoire-group.json"] },
{ "data_type" : "Repertoire", "data_files" : [ "vdjserver-metadata.json"] },
{ "data_type" : "Rearrangement", "data_files" : ["vdjserver.tsv"] }
]
},
I could also see how one could create two RepertoireGroup
files that essentially sliced a download Manifest across an orthogonal variable such as disease_diagnosis (e.g. DOID:9952 and DOID:9119). The download contains data from three repositories across both conditions, the RepertoireGroup
splits the repertoire_ids
into two groups based on DOID. That split is the basis for a comparative analysis across disease conditions and the downstream analysis tool can use that to extract the data appropriately (across all three repositories) to perform the analysis. Nice...
@scharch @schristley I think we are done 8-)
@scharch @schristley I think we are done 8-)
Tongue in cheek comment, I didn't really mean that I think we are done... I am re-opening this issue - sorry for the confusion.
@scharch in all seriousness, this meets my expectations/thoughts on what one might want from a RepertoireGroup
.
I am not so sure on your use case and what you are thinking about this...
Brainstorming a pretty complex use case that combines Manifest and RepertoireGroup as we might use it on the iReceptor Gateway in an Analysis App.
A file that describes two RepertoireGroups
for comparative analysis (COVID vs Healthy Controls), TRA locus only:
{
"Info": STUFF OMITTED,
"ReprtoireGroup":
[
{
"repertoire_group_id" : "globally unique id 1",
"repertoire_group_name" : "COVID-19 cohort (TRA)",
"repertoire_group_description" : "Diagnosis (Ontology ID): DOID:0080600, PCR target: TRA",
"repertoires": [
{"repertoire_id": "8178786653546746346-242ac113-0001-012"},
[169 Repertoires in Total from ADC]
]
},
{
"repertoire_group_id" : "globally unique id 2",
"repertoire_group_name" : "Control (Healthy) cohort (TRA)",
"repertoire_group_description" : "Study Group: Control (Healthy), PCR target: TRA",
"repertoires": [
{"repertoire_id": "8178786653546746781-242ac113-0001-012"},
[9 Repertoires in Total from ADC]
]
}
] # End RepertoireGroup
} # End file
Combine that with a manifest from a download like this:
"DataSets": [
{
"dataset_name": "VDJServer",
"repository_url": "https://vdjserver.org/airr/v1/",
"dataset": [
{ "data_type" : "RepertoireGroup", "data_files" : [ "vdjserver-repertoire-group.json"] },
{ "data_type" : "Repertoire", "data_files" : [ "vdjserver-metadata.json"] },
{ "data_type" : "Rearrangement", "data_files" : ["vdjserver.tsv"] }
]
},
{
"dataset_name": "IPA1",
"repository_url": "https://ipa1.ireceptor.org/airr/v1/",
"dataset": [
{ "data_type" : "RepertoireGroup", "data_files" : [ "repertoire-group.json"] },
{ "data_type" : "Repertoire", "data_files" : [ "ipa1-metadata.json"] },
{ "data_type" : "Rearrangement", "data_files" : ["ipa1.tsv"] }
]
},
Assume an analysis tool designed to do comparative analyses on N Repertoire Groups. Expects a RepertoireGroup.json, a Manifest.json file, and all of the relevant files described in the Manifest.json.
Processing would:
Here is another interesting one - real study, with real data - Go to the Gateway and search for TRA and PRJCA002413:
{
"Info": STUFF OMITTED,
"ReprtoireGroup":
[
{
"repertoire_group_id" : "globally unique id 1",
"repertoire_group_name" : "Early Recovery Cohort",
"repertoire_group_description" : "study_id: PRJCA002413, subject_id: ERS, disease_diagnosis: DOID:0080600, ",
"repertoires": [
{"repertoire_id": "PRJCA002413-ERS1-TRA"},
{"repertoire_id": "PRJCA002413-ERS2-TRA"},
{"repertoire_id": "PRJCA002413-ERS3-TRA"},
{"repertoire_id": "PRJCA002413-ERS4-TRA"},
{"repertoire_id": "PRJCA002413-ERS5-TRA"},
]
},
{
"repertoire_group_id" : "globally unique id 2",
"repertoire_group_name" : "Late Recovery Cohort",
"repertoire_group_description" : "study_id: PRJCA002413, subject_id: LRS, disease_diagnosis: DOID:0080600, ",
"repertoires": [
{"repertoire_id": "PRJCA002413-LRS1-TRA"},
{"repertoire_id": "PRJCA002413-LRS2-TRA"},
{"repertoire_id": "PRJCA002413-LRS3-TRA"},
{"repertoire_id": "PRJCA002413-LRS4-TRA"},
{"repertoire_id": "PRJCA002413-LRS5-TRA"},
]
},
{
"repertoire_group_id" : "globally unique id 3",
"repertoire_group_name" : "Control (Healthy) Cohort",
"repertoire_group_description" : "study_id: PRJCA002413, subject_id: HC, disease_diagnosis: DOID:0080600, ",
"repertoires": [
{"repertoire_id": "PRJCA002413-Healthy_Control_1-TRA"},
{"repertoire_id": "PRJCA002413-Healthy_Control_2-TRA"},
{"repertoire_id": "PRJCA002413-Healthy_Control_3-TRA"},
{"repertoire_id": "PRJCA002413-Healthy_Control_4-TRA"},
{"repertoire_id": "PRJCA002413-Healthy_Control_5-TRA"},
]
}
] # End RepertoireGroup
} # End file
With manifest as below (this is essentially what you get from the Gateway if you were to download this data):
"DataSets": [
{
"dataset_name": "VDJServer",
"repository_url": "https://covid19-1.ireceptor.org/airr/v1/",
"dataset": [
{ "data_type" : "RepertoireGroup", "data_files" : [ "PRJCA002413-repertoire-group.json"] },
{ "data_type" : "Repertoire", "data_files" : [ "covid19-1-metadata.json"] },
{ "data_type" : "Rearrangement", "data_files" : ["covid19-1.tsv"] }
]
}
The RepertoireGroup file says there are three cohorts being considered.
The Manifest file says that there is one source DataSet to consider.
Assuming the analysis is the same as the previous scenario, the analysis then splits covdi19-1.tsv into three separate files, one for ERS, LRS, and HC and then runs a comparative analysis across those three RepertoireGroups.
Exactly the same data but a RepertoireGroup file that only contains LRS and ERS gives you a late recovery and early recovery comparison (ignores healthy controls).
Exactly the same data, but with five RepertoireGroups in the file, each with one repertoire per group for each of ERS1, ERS2, ERS3, ERS4, ERS5 gives you a comparative analysis across the five different early recovery subjects (ignores all of the LRS and HC repertoires) .
This seems pretty powerful to me... We can basically use the RepertoireGroup file to slice a given data set in any way that we want (at the repertoire_id level at least).
The big thing that's missing is the ability to extract and combine parts of Repertoire
s. What if I want to analyze only IgG? Or only CD27+CD21- activated B cells from a CITEseq data set or only RBD-binding B cells from a LIBRAseq data set?
I think this may be related to differences in our conceptions of what RepertoireGroup
is for. In your examples, each cohort is a RepertoireGroup
and the analysis is done between them. @schristley has expressed a similar view point here. But to me, one of the main points of RepertoireGroup
is to be the fundamental unit of analysis, as I've advocated for a while.
Should RepertoireGroup
include a study_id
? It wouldn't be relevant to an ADC download, but a metaanlysis would be a new Study
, right?
The big thing that's missing is the ability to extract and combine parts of
Repertoire
s. What if I want to analyze only IgG? Or only CD27+CD21- activated B cells from a CITEseq data set or only RBD-binding B cells from a LIBRAseq data set?I think this may be related to differences in our conceptions of what
RepertoireGroup
is for. In your examples, each cohort is aRepertoireGroup
and the analysis is done between them. @schristley has expressed a similar view point here. But to me, one of the main points ofRepertoireGroup
is to be the fundamental unit of analysis, as I've advocated for a while.
To do that, would it suffice to have a repertoire_id
, data_processing_id
, sample_processing_id
to specify a part of a Repertoire
? Or are you thinking of an even lower level of granularity than that? Do the different data processing and sample processing of the samples/data give you what you are looking for?
{
"repertoire_group_id" : "globally unique id 1",
"repertoire_group_name" : "Early Recovery Cohort",
"repertoire_group_description" : "study_id: PRJCA002413, subject_id: ERS, disease_diagnosis: DOID:0080600, ",
"repertoires": [
{"repertoire_id": "PRJCA002413-ERS1", "sample_processing_id":"XXX", "data_processing_id":"YYY"},
{"repertoire_id": "PRJCA002413-ERS2", "sample_processing_id":"XXX", "data_processing_id":"YYY"}
]
}
No, because I might want an arbitrary filter criteria that doesn't align with the original researcher's sample/data processing. EG:
Rearrangement
sCell
s identified as TEMRA by CellTypistCell
s with non-zero CD11c in CellExpression
8-)OK, now that is complicated... and I was afraid that was going to be your answer 8-)
Any ideas how to describe that???
And are we "simply" describing what is in the data set?
Or are we describing what one needs to extract from the data set? That is the TSV file contains all rearrangements and we say:
v_family = IGHV-4
to describe what needs to be extracted?
We could probably use an ADC query to describe what is in the data set???
We could probably use an ADC query to describe what is in the data set???
I was thinking along those lines, yes, but ultimately it probably needs to be free text.
And are we "simply" describing what is in the data set? Or are we describing what one needs to extract from the data set?
Again, ideally the latter. But practically, probably the former.
The big thing that's missing is the ability to extract and combine parts of
Repertoire
s. What if I want to analyze only IgG? Or only CD27+CD21- activated B cells from a CITEseq data set or only RBD-binding B cells from a LIBRAseq data set?I think this may be related to differences in our conceptions of what
RepertoireGroup
is for. In your examples, each cohort is aRepertoireGroup
and the analysis is done between them. @schristley has expressed a similar view point here. But to me, one of the main points ofRepertoireGroup
is to be the fundamental unit of analysis, as I've advocated for a while.
@scharch Yes, and I agree with your advocation! The first pass of RepertoireGroup
was purposefully simple, group repertoires together (as a whole) so you can do computations both intra- and inter-group. This works great for its purpose, I got it integrated into the VDJServer pipelines
Now I say, "as a whole", but I'm lying. I'm already doing some filtering because I almost always want just productive rearrangements, so I filter out the nonproductives. I've always had it in my mind to extend that filtering capability, and have delayed doing anything, but I increasingly need it.
My anecdote? In a collaboration, I'm asked, can you give them those stats (gene usage, etc) for just IGHV4? No problem, I'll write a one-off script to pull out the rearrangements and viola. Oh, if it isn't hard, give it to me for all VH families. Then two days later, can you give me IGHV4 but also separated by J family? You see where this is going... Before you know it, my one-off script is multiple scripts with multiple parameters. I have a multi-TB analysis tree with dozen of directories where the rearrangements have been duplicated numerous times... Exactly what I didn't want!
We could probably use an ADC query to describe what is in the data set???
That's exactly what I was thinking. When I started thinking about how to generalize filtering, what makes sense is to use the ADC query language to describe the filter instead of inventing something new. This isn't too hard to implement so long as you are not that concerned about efficiency, you compute the expression tree implied by the ADC query on each rearrangement record, and it comes out true or false.
But, and this is a big butt (sorry ;-), I've only been thinking in the context of filtering Rearrangements. If we want a more general filtering which works on Clones and Cells and whatever, that starts getting more complicated. Also, the filtering is only on fields of that one type, i.e. fields in the rearrangement file. If you want to express things that crosses types (e.g. filter on both Rearrangement and Cell fields) then it gets even more complicated.
We could probably use an ADC query to describe what is in the data set???
I was thinking along those lines, yes, but ultimately it probably needs to be free text.
That's almost worthless honestly. If that is all you want, stick it in repertoire_group_description
and be done. The useful thing is a precise definition so that an analysis tool can do it for you.
One idea is try not to do everything in one chunk. We have the basic RepertoireGroup
now, what's a simple but useful way to add filtering, even if it doesn't give us everything?
To do that, would it suffice to have a
repertoire_id
,data_processing_id
,sample_processing_id
to specify a part of aRepertoire
?
No, but actually we need this anyways. The simple repertoire_id
isn't sufficient. It works now because we mostly limit ourselves to a single data processing.
That's almost worthless honestly. If that is all you want, stick it in
repertoire_group_description
and be done. The useful thing is a precise definition so that an analysis tool can do it for you.
I agree with the principle, of course. But I was more thinking about
Also, the filtering is only on fields of that one type, i.e. fields in the rearrangement file. If you want to express things that crosses types (e.g. filter on both Rearrangement and Cell fields) then it gets even more complicated.
I think that the "right" answer is probably a DataProcessing
-type object, although I'm not sure that gets all the way to the rigor you are looking for.
Also, the filtering is only on fields of that one type, i.e. fields in the rearrangement file. If you want to express things that crosses types (e.g. filter on both Rearrangement and Cell fields) then it gets even more complicated.
I think that the "right" answer is probably a
DataProcessing
-type object, although I'm not sure that gets all the way to the rigor you are looking for.
If we don't try to do everything but think that allowing filters on object types would be useful, then a simple enhancement to RepertoireGroup
should be able to provide this. Like with the manifest, we could add a type-keyed filters
which provides the filter, in ADC query format. For example, here is a filter on Rearrangement
indicating to include productive records for VH4 family:
RepertoireGroup:
- repertoire_group_id: tumor
repertoires:
- repertoire_id: 33983-3T-30_S14
- repertoire_id: 32277-3T-36_S20
- repertoire_id: 30174-2T-29_S13
filters:
Rearrangement:
op: and
content:
- op: =
content:
field: productive
value: true
- op: =
content:
field: v_subgroup
value: IGHV4
and with the type-key, that allows us to specify more, so here's the same example but also provide a filter on clones.
RepertoireGroup:
- repertoire_group_id: tumor
repertoires:
- repertoire_id: 33983-3T-30_S14
- repertoire_id: 32277-3T-36_S20
- repertoire_id: 30174-2T-29_S13
filters:
Rearrangement:
op: and
content:
- op: =
content:
field: productive
value: true
- op: =
content:
field: v_subgroup
value: IGHV4
Clone:
op: =
content:
field: v_subgroup
value: IGHV4
At least for my test case, this would be highly useful. I could create groups for each of the combinations of V and J families that I need, and yes it would be a lot, but then I could run repcalc
once, it would process all of the groups, generate output files for all of those groups, and I would never have to duplicate the rearrangements files.
Also, the filtering is only on fields of that one type, i.e. fields in the rearrangement file. If you want to express things that crosses types (e.g. filter on both Rearrangement and Cell fields) then it gets even more complicated.
I think that the "right" answer is probably a
DataProcessing
-type object, although I'm not sure that gets all the way to the rigor you are looking for.
Doing something like, give me the Rearrangement
that are in a Cell
where the CellExpression
of the gene ABC > 5, would require a query language with power like SQL. I don't see that happening. And then yes, you'd likely write a custom script, describe the use of that script in DataProcessing
, and like with my use case above, I would still use RepertoireGroup
but my tool would not use the raw rearrangement files as input but instead the post-filtered rearrangements output by the custom script.
Doing something like, give me the
Rearrangement
that are in aCell
where theCellExpression
of the gene ABC > 5, would require a query language with power like SQL. I don't see that happening.
Yeah, this is where I was coming from, but there's definitely plenty of utility in simpler cases. I want to sleep on it a bit, but what you've proposed above seems pretty good to me at first blush.
If we don't try to do everything but think that allowing filters on object types would be useful, then a simple enhancement to
RepertoireGroup
should be able to provide this. Like with the manifest, we could add a type-keyedfilters
which provides the filter, in ADC query format.
@schristley I'm thinking maybe the filters should be per Repertoire
, not necessarily over the whole RepertoireGroup
. What do you think about that?
More generally, how should the filters be described in the schema? I guess we'll need to provide an explicit list of type keys like we have for DataFile
, but in theory op
and content
are arbitrarily recursive, so ...???
@schristley I'm thinking maybe the filters should be per
Repertoire
, not necessarily over the wholeRepertoireGroup
. What do you think about that?
Sure, just put the filters
with each repertoire entry. I wouldn't want to get rid of the filters at the RepertoireGroup
level though because that's convenient (and less prone to error) if you want the same filter applied to all repertoires.
More generally, how should the filters be described in the schema? I guess we'll need to provide an explicit list of type keys like we have for
DataFile
, but in theoryop
andcontent
are arbitrarily recursive, so ...???
Not easily. We don't try with the ADC API and instead just describe in the docs.
Doing something like, give me the
Rearrangement
that are in aCell
where theCellExpression
of the gene ABC > 5, would require a query language with power like SQL. I don't see that happening.Yeah, this is where I was coming from, but there's definitely plenty of utility in simpler cases. I want to sleep on it a bit, but what you've proposed above seems pretty good to me at first blush.
As @schristley says, doing this as a single query in the ADC is probably unlikely. With that said this is very possible with multiple queries across the ADC endpoints. In fact, the above query is exactly what the iReceptor Gateway Cell page does currently.
This is a search for Expression of TRBV4-1 > 10 across all T-cell repertoires in a specific study. If you click Download you get all the Rearrangements, Cells, and Expression data.
So the ADC queries are powerful, you "just" can't do complicated joins across the collections. The iReceptor Gateway does 1 repertoire, 12 expression, 10 cell, and 10 rearrangement queries to gather the data presented in the page below 8-)
With that said, finding all the Cells with that expression level is one query:
curl -s -d @query2.json https://covid19-1.ireceptor.org/airr/v1/expression
where query2.json is:
{
"filters": {
"op": "and",
"content": [
{
"op": "=",
"content": {
"field": "property.label",
"value": "TRBV4-1"
}
},
{
"op": ">",
"content": {
"field": "value",
"value": 10
}
},
{
"op": "in",
"content": {
"field": "repertoire_id",
"value": [
"PRJCA002413-ERS1-TR-CELL",
"PRJCA002413-ERS2-TR-CELL",
"PRJCA002413-ERS3-TR-CELL",
"PRJCA002413-ERS4-TR-CELL",
"PRJCA002413-ERS5-TR-CELL",
"PRJCA002413-LRS1-TR-CELL",
"PRJCA002413-LRS2-TR-CELL",
"PRJCA002413-LRS3-TR-CELL",
"PRJCA002413-LRS4-TR-CELL",
"PRJCA002413-LRS5-TR-CELL",
"PRJCA002413-Healthy_Control_1-TR-CELL",
"PRJCA002413-Healthy_Control_2-TR-CELL",
"PRJCA002413-Healthy_Control_3-TR-CELL",
"PRJCA002413-Healthy_Control_4-TR-CELL",
"PRJCA002413-Healthy_Control_5-TR-CELL"
]
}
}
]
}
}
A couple of questions that came up in an iReceptor Plus meeting today:
repertoire_group_url
to the fields of RepertoireGroup to uniquely identify the source of the repertoire and where it was found. Or do we consider this implicit in the sense that one would have a separate RepertoireGroup for each repository in an analysis. This implicit definition is challenging if you want to have a RepertoireGroup that is grouping repertoires from multilple repositories.repertoire_id
is sufficient to completely describe the groups of repertoires that we might want to group together. For example, if we wanted a RepertoireGroup that only contained IGH and I have a different SampeProcessing or DataProcessing for these rearrangements, I would need to provide asample_processsing_id
or adata_processing_id
to accurately describe the set of rearrangements that are being considered as part of this RepertoireGroup.