RepertoireGroup refinements

bcorrie commented 2 years ago

A couple of questions that came up in an iReceptor Plus meeting today:

From an ADC perspective, repertoire_ids are typically unique within a given context (in the ADC they are unique within a repository). It seems useful to add a repertoire_group_url to the fields of RepertoireGroup to uniquely identify the source of the repertoire and where it was found. Or do we consider this implicit in the sense that one would have a separate RepertoireGroup for each repository in an analysis. This implicit definition is challenging if you want to have a RepertoireGroup that is grouping repertoires from multilple repositories.
I am not sure only having repertoire_id is sufficient to completely describe the groups of repertoires that we might want to group together. For example, if we wanted a RepertoireGroup that only contained IGH and I have a different SampeProcessing or DataProcessing for these rearrangements, I would need to provide a sample_processsing_id or a data_processing_id to accurately describe the set of rearrangements that are being considered as part of this RepertoireGroup.
Do we really want a TimePoint object in the RepertoireGroup? This seems to imply that RepertoireGroups are targeted at data over multiple time points. Although that might be a common use case, it is certainly not the only one that make sense...

schristley commented 2 years ago

From an ADC perspective, repertoire_ids are typically unique within a given context (in the ADC they are unique within a repository). It seems useful to add a repertoire_group_url to the fields of RepertoireGroup to uniquely identify the source of the repertoire and where it was found. Or do we consider this implicit in the sense that one would have a separate RepertoireGroup for each repository in an analysis. This implicit definition is challenging if you want to have a RepertoireGroup that is grouping repertoires from multilple repositories.

No, I would not want the implicit definition, because it would be very useful (and possibly a quite common use case) to group repertoires across multiple repositories. I'm assuming this use case actually with VDJServer.

From an ADC perspective, I think all identifiers (which are meant to be FAIR) should be PIDs, except those few like subject_id and sample_id which are local to a study. I don't think that a separate repertoire_group_url field is needed though as the repertoire_group_id can use a CURIE (or decentralized identifer).

But it is an open question on whether RepertoireGroup is a top-level object that has its own ADC entry point, and thus needs a PID? It's maybe possible that RepertoireGroup is embedded within the objects that use it, e.g. DataProcessing.

schristley commented 2 years ago

I am not sure only having repertoire_id is sufficient to completely describe the groups of repertoires that we might want to group together. For example, if we wanted a RepertoireGroup that only contained IGH and I have a different SampeProcessing or DataProcessing for these rearrangements, I would need to provide a sample_processsing_id or a data_processing_id to accurately describe the set of rearrangements that are being considered as part of this RepertoireGroup.

Yes but that's purposefully not in the the scope of RepertoireGroup. An analogy would be wanting only the productive rearrangements within a Repertoire; this is not defined with Repertoire but in the DataProcessing that acts upon that repertoire. The same would be true with RepertoireGroup. With the redesign of DataProcessing, it points to Repertoire and RepertoireGroup, instead of vice versa.

schristley commented 2 years ago

Do we really want a TimePoint object in the RepertoireGroup? This seems to imply that RepertoireGroups are targeted at data over multiple time points. Although that might be a common use case, it is certainly not the only one that make sense...

Yes so RepertoireGroup can be re-used for both: (1) a set of repertoires, and (2) a sequence of repertoires for a time course. It's optional, so leave it out if it doesn't apply.

bcorrie commented 2 years ago

From an ADC perspective, repertoire_ids are typically unique within a given context (in the ADC they are unique within a repository). It seems useful to add a repertoire_group_url to the fields of RepertoireGroup to uniquely identify the source of the repertoire and where it was found. Or do we consider this implicit in the sense that one would have a separate RepertoireGroup for each repository in an analysis. This implicit definition is challenging if you want to have a RepertoireGroup that is grouping repertoires from multilple repositories.

No, I would not want the implicit definition, because it would be very useful (and possibly a quite common use case) to group repertoires across multiple repositories. I'm assuming this use case actually with VDJServer.

Yes, I agree, same here. When you do a download on the Gateway, you are essentially creating a RepertoireGroup from multiple repositories. I suggested a URL as a mechanism to define the repository because it is easy and we can define that field today and use it. PIDs are challenging to get agreement on how and when - which means we don't have a usable RepertoireGroup until we sort out the harder problem - which is going to be when??? 8-)

bcorrie commented 2 years ago

Do we really want a TimePoint object in the RepertoireGroup? This seems to imply that RepertoireGroups are targeted at data over multiple time points. Although that might be a common use case, it is certainly not the only one that make sense...

Yes so RepertoireGroup can be re-used for both: (1) a set of repertoires, and (2) a sequence of repertoires for a time course. It's optional, so leave it out if it doesn't apply.

It just seems to be promoting time point as a "special case" when to me really it is one of many criteria/fields from the Repertoire metadata that you may use to group repertoires.

Why is grouping via time point a more important grouping criteria than disease state or tissue type.

bcorrie commented 2 years ago

But it is an open question on whether RepertoireGroup is a top-level object that has its own ADC entry point, and thus needs a PID? It's maybe possible that RepertoireGroup is embedded within the objects that use it, e.g. DataProcessing.

I don't think our repositories would store RepetoireGroups (and therefore you wouldn't need an API to look them up), but I could see how you might want to return a RepertoireGroup as an object. For example, the /airr/v1/repertoire end point could be asked to return a Repertoire JSON or a RepertoireGroup JSON object. The later would provide a concise list of the Repertoire objects that your query produced rather than the JSON for all of the metadata for those Repertoires.

schristley commented 2 years ago

From an ADC perspective, repertoire_ids are typically unique within a given context (in the ADC they are unique within a repository). It seems useful to add a repertoire_group_url to the fields of RepertoireGroup to uniquely identify the source of the repertoire and where it was found. Or do we consider this implicit in the sense that one would have a separate RepertoireGroup for each repository in an analysis. This implicit definition is challenging if you want to have a RepertoireGroup that is grouping repertoires from multilple repositories.

No, I would not want the implicit definition, because it would be very useful (and possibly a quite common use case) to group repertoires across multiple repositories. I'm assuming this use case actually with VDJServer.

Yes, I agree, same here. When you do a download on the Gateway, you are essentially creating a RepertoireGroup from multiple repositories. I suggested a URL as a mechanism to define the repository because it is easy and we can define that field today and use it. PIDs are challenging to get agreement on how and when - which means we don't have a usable RepertoireGroup until we sort out the harder problem - which is going to be when??? 8-)

Well, you are the one that created this issue and wanted "resolvability" before RepertoireGroups are usable ;-D You are welcome to add a repertoire_group_url to identify the source; I just don't believe that solves the problem of "grouping repertoires from multilple repositories".

IMO, to use RepertoireGroup on the Gateway today, we don't need agreement on the details of PIDs or resolvable IDs. All we need is agreement that the repertoire_id returned by a repository is globally unique, so that tools can assume that (e.g. the Gateway can create a RepertoireGroup with repertoires from multiple repositories and not worry that the repertoire_id will conflict).

schristley commented 2 years ago

Do we really want a TimePoint object in the RepertoireGroup? This seems to imply that RepertoireGroups are targeted at data over multiple time points. Although that might be a common use case, it is certainly not the only one that make sense...

Yes so RepertoireGroup can be re-used for both: (1) a set of repertoires, and (2) a sequence of repertoires for a time course. It's optional, so leave it out if it doesn't apply.

It just seems to be promoting time point as a "special case" when to me really it is one of many criteria/fields from the Repertoire metadata that you may use to group repertoires.

Why is grouping via time point a more important grouping criteria than disease state or tissue type.

The difference is that I consider that time course is not a grouping criteria like disease state or tissue type. In those examples, you are referring to a property of Repertoire (like disease_state_sample or tissue). The time course is an attribute on the group itself, it says the group is a sequence of repertoires (versus a set of repertoires) and thus the order of repertoires has importance. But whatever, that's discussing semantics, which tends to get us nowhere, because we have different viewpoints, so let's talk specific changes instead.

Do you want to eliminate the time point from RepertoireGroup? If so, then we still need a mechanism to describe a sequence of repertoires which can have user-defined labels, that IMO can be used to specify generic ordering but also time course data, what is your suggestion for how to specify that in RepertoireGroup?

scharch commented 2 years ago

I don't think our repositories would store RepetoireGroups (and therefore you wouldn't need an API to look them up), but I could see how you might want to return a RepertoireGroup as an object. For example, the /airr/v1/repertoire end point could be asked to return a Repertoire JSON or a RepertoireGroup JSON object.

So then the results of the analysis on that RepertoireGroup couldn't be returned to the ADC?

In theory, this should be cheap, as it will contain neither Rearrangements nor Cells --a RepertoireGroup is really just a query, in some sense. On the other hand, putting it that way makes it seem like a separate endpoint might be redundant (querying for queries)...?

schristley commented 2 years ago

I don't think our repositories would store RepetoireGroups (and therefore you wouldn't need an API to look them up), but I could see how you might want to return a RepertoireGroup as an object. For example, the /airr/v1/repertoire end point could be asked to return a Repertoire JSON or a RepertoireGroup JSON object.

So then the results of the analysis on that RepertoireGroup couldn't be returned to the ADC?

In theory, this should be cheap, as it will contain neither Rearrangements nor Cells --a RepertoireGroup is really just a query, in some sense. On the other hand, putting it that way makes it seem like a separate endpoint might be redundant (querying for queries)...?

VDJServer will store RepertoireGroups just like it currently stores "groups" that users define for comparative analysis. We will just switch from our internal group format to AIRR RepertoireGroup. These are nice provenance objects about the cohorts, which can be hard to reverse-engineer from SRA records. Though, I think having a query end point for them is a bit overkill, more likely they can be returned as additional objects with Repertoire or DataProcessing. Likewise, creating a RepertoireGroup to "group" the repertoires returned from an ADC query is a convenience, but you are only saving yourself one line of code, something like this:

rg['repertoires'] = [ { repertoire_id: rep['repertoire_id'] } for rep in data['Repertoire'] ]

Right now, none of the studies in the ADC have cohorts defined as RepertoireGroups. That would be nice to have though.

bcorrie commented 7 months ago

I just revisited the current definition and the discussion above. Based on all of that, I think the RepertoireGroup that we have now seems fairly good for what was its original intent. For example, given the discussion in #548 I could see creating a RepertoireGroup that encompassed the entire download from the iReceptor Gateway.

This manifest:

{
    "Info": {
        "title": "AIRR Manifest",
        "version": "3.0",
        "description": "List of files for each repository",
        "contact": {
            "name": "iReceptor Gateway",
            "url": "https://gateway.ireceptor.org",
            "email": "support@ireceptor.org"
        }
    },
    "DataSets": [
        {
            "dataset_name": "VDJServer",
            "repository_url": "https://vdjserver.org/airr/v1/",
            "dataset": [
              { "data_type" : "Repertoire", "data_files" : [ "vdjserver-metadata.json"] },
              { "data_type" : "Rearrangement", "data_files" : ["vdjserver.tsv"] }
            ]
        },
        {
            "dataset_name": "VDJbase",
            "repository_url": "https://airr-seq.vdjbase.org/airr/v1/",
            "dataset": [
              { "data_type" : "Repertoire", "data_files" : [ "vdjbase-metadata.json"] },
              { "data_type" : "Rearrangement", "data_files" : ["vdjbase.tsv"] }
            ]
        },
        {
            "dataset_name": "COVID 19-1",
            "repository_url": "https://covid19-1.ireceptor.org/airr/v1/",
            "dataset": [
              { "data_type" : "Repertoire", "data_files" : [ "airr-covid-19-metadata.json"] },
              { "data_type" : "Rearrangement", "data_files" : ["airr-covid-19.tsv"] }
            ]
        }
    ]
}

Could have a RepertoireGroup that looked like this:

{
  "repertoire_group_id" : "globally unique id 1",
  # Info about the download and when it happened
  "repertoire_group_name" : "iReceptor download - bcorrie - 2024-02-09",
  # Human readable description of Repertoire filters used to generate the data.
  # We do this for download today already
  "repertoire_group_description" : "Diagnosis (Ontology ID): DOID:9952 or DOID:9119, PCR target: TRA",
  "repertoires": [
    {"repertoire_id": "8178786653546746346-242ac113-0001-012"},
    {"repertoire_id": "8196138321422586346-242ac113-0001-012"},
    {"repertoire_id": "8213318190606586346-242ac113-0001-012"}

[Stuff deleted from other repertoire (this could be quite long)]

  ]
}

I could also see having a RepertoireGroup per repository, so part of my manifest might be:

    "DataSets": [
        {
            "dataset_name": "VDJServer",
            "repository_url": "https://vdjserver.org/airr/v1/",
            "dataset": [
              { "data_type" : "RepertoireGroup", "data_files" : [ "vdjserver-repertoire-group.json"] },
              { "data_type" : "Repertoire", "data_files" : [ "vdjserver-metadata.json"] },
              { "data_type" : "Rearrangement", "data_files" : ["vdjserver.tsv"] }
            ]
        },

bcorrie commented 7 months ago

I could also see how one could create two RepertoireGroup files that essentially sliced a download Manifest across an orthogonal variable such as disease_diagnosis (e.g. DOID:9952 and DOID:9119). The download contains data from three repositories across both conditions, the RepertoireGroup splits the repertoire_ids into two groups based on DOID. That split is the basis for a comparative analysis across disease conditions and the downstream analysis tool can use that to extract the data appropriately (across all three repositories) to perform the analysis. Nice...

bcorrie commented 7 months ago

@scharch @schristley I think we are done 8-)

bcorrie commented 7 months ago

@scharch @schristley I think we are done 8-)

Tongue in cheek comment, I didn't really mean that I think we are done... I am re-opening this issue - sorry for the confusion.

bcorrie commented 7 months ago

@scharch in all seriousness, this meets my expectations/thoughts on what one might want from a RepertoireGroup.

I am not so sure on your use case and what you are thinking about this...

bcorrie commented 7 months ago

Brainstorming a pretty complex use case that combines Manifest and RepertoireGroup as we might use it on the iReceptor Gateway in an Analysis App.

A file that describes two RepertoireGroups for comparative analysis (COVID vs Healthy Controls), TRA locus only:

{
"Info": STUFF OMITTED,
"ReprtoireGroup":
[
  {
    "repertoire_group_id" : "globally unique id 1",
    "repertoire_group_name" : "COVID-19 cohort (TRA)",
    "repertoire_group_description" : "Diagnosis (Ontology ID): DOID:0080600, PCR target: TRA",
    "repertoires": [
       {"repertoire_id": "8178786653546746346-242ac113-0001-012"},
       [169 Repertoires in Total from ADC]
    ]
  },
  {
    "repertoire_group_id" : "globally unique id 2",
    "repertoire_group_name" : "Control (Healthy) cohort (TRA)",
    "repertoire_group_description" : "Study Group: Control (Healthy), PCR target: TRA",
    "repertoires": [
       {"repertoire_id": "8178786653546746781-242ac113-0001-012"},
       [9 Repertoires in Total from ADC]
    ]
  }
] # End RepertoireGroup
} # End file

Combine that with a manifest from a download like this:

    "DataSets": [
        {
            "dataset_name": "VDJServer",
            "repository_url": "https://vdjserver.org/airr/v1/",
            "dataset": [
              { "data_type" : "RepertoireGroup", "data_files" : [ "vdjserver-repertoire-group.json"] },
              { "data_type" : "Repertoire", "data_files" : [ "vdjserver-metadata.json"] },
              { "data_type" : "Rearrangement", "data_files" : ["vdjserver.tsv"] }
            ]
        },
        {
            "dataset_name": "IPA1",
            "repository_url": "https://ipa1.ireceptor.org/airr/v1/",
            "dataset": [
              { "data_type" : "RepertoireGroup", "data_files" : [ "repertoire-group.json"] },
              { "data_type" : "Repertoire", "data_files" : [ "ipa1-metadata.json"] },
              { "data_type" : "Rearrangement", "data_files" : ["ipa1.tsv"] }
            ]
        },

Assume an analysis tool designed to do comparative analyses on N Repertoire Groups. Expects a RepertoireGroup.json, a Manifest.json file, and all of the relevant files described in the Manifest.json.

Processing would:

Extract all of the data that has a repertoire_id in RepertoireGroup "globally unique id 1" from vdjserver.tsv and ipa1.tsv into a covid19.tsv file
Extract all of the data that has a repertoire_id in RepertoireGroup "globally unique id 2" from vdjserver.tsv and ipa1.tsv into a healthy.tsv file
Run the comparative analysis tool to compare covid19.tsv and healthy.tsv

bcorrie commented 7 months ago

Here is another interesting one - real study, with real data - Go to the Gateway and search for TRA and PRJCA002413:

{
"Info": STUFF OMITTED,
"ReprtoireGroup":
[
  {
    "repertoire_group_id" : "globally unique id 1",
    "repertoire_group_name" : "Early Recovery Cohort",
    "repertoire_group_description" : "study_id: PRJCA002413, subject_id: ERS, disease_diagnosis: DOID:0080600, ",
    "repertoires": [
       {"repertoire_id": "PRJCA002413-ERS1-TRA"},
       {"repertoire_id": "PRJCA002413-ERS2-TRA"},
       {"repertoire_id": "PRJCA002413-ERS3-TRA"},
       {"repertoire_id": "PRJCA002413-ERS4-TRA"},
       {"repertoire_id": "PRJCA002413-ERS5-TRA"},
    ]
  },
  {
    "repertoire_group_id" : "globally unique id 2",
    "repertoire_group_name" : "Late Recovery Cohort",
    "repertoire_group_description" : "study_id: PRJCA002413, subject_id: LRS, disease_diagnosis: DOID:0080600, ",
    "repertoires": [
       {"repertoire_id": "PRJCA002413-LRS1-TRA"},
       {"repertoire_id": "PRJCA002413-LRS2-TRA"},
       {"repertoire_id": "PRJCA002413-LRS3-TRA"},
       {"repertoire_id": "PRJCA002413-LRS4-TRA"},
       {"repertoire_id": "PRJCA002413-LRS5-TRA"},
    ]
  },
  {
    "repertoire_group_id" : "globally unique id 3",
    "repertoire_group_name" : "Control (Healthy) Cohort",
    "repertoire_group_description" : "study_id: PRJCA002413, subject_id: HC, disease_diagnosis: DOID:0080600, ",
    "repertoires": [
       {"repertoire_id": "PRJCA002413-Healthy_Control_1-TRA"},
       {"repertoire_id": "PRJCA002413-Healthy_Control_2-TRA"},
       {"repertoire_id": "PRJCA002413-Healthy_Control_3-TRA"},
       {"repertoire_id": "PRJCA002413-Healthy_Control_4-TRA"},
       {"repertoire_id": "PRJCA002413-Healthy_Control_5-TRA"},
    ]
  }
] # End RepertoireGroup
} # End file

With manifest as below (this is essentially what you get from the Gateway if you were to download this data):

    "DataSets": [
        {
            "dataset_name": "VDJServer",
            "repository_url": "https://covid19-1.ireceptor.org/airr/v1/",
            "dataset": [
              { "data_type" : "RepertoireGroup", "data_files" : [ "PRJCA002413-repertoire-group.json"] },
              { "data_type" : "Repertoire", "data_files" : [ "covid19-1-metadata.json"] },
              { "data_type" : "Rearrangement", "data_files" : ["covid19-1.tsv"] }
            ]
        }

The RepertoireGroup file says there are three cohorts being considered.

The Manifest file says that there is one source DataSet to consider.

Assuming the analysis is the same as the previous scenario, the analysis then splits covdi19-1.tsv into three separate files, one for ERS, LRS, and HC and then runs a comparative analysis across those three RepertoireGroups.

Exactly the same data but a RepertoireGroup file that only contains LRS and ERS gives you a late recovery and early recovery comparison (ignores healthy controls).

Exactly the same data, but with five RepertoireGroups in the file, each with one repertoire per group for each of ERS1, ERS2, ERS3, ERS4, ERS5 gives you a comparative analysis across the five different early recovery subjects (ignores all of the LRS and HC repertoires) .

This seems pretty powerful to me... We can basically use the RepertoireGroup file to slice a given data set in any way that we want (at the repertoire_id level at least).

scharch commented 7 months ago

The big thing that's missing is the ability to extract and combine parts of Repertoires. What if I want to analyze only IgG? Or only CD27+CD21- activated B cells from a CITEseq data set or only RBD-binding B cells from a LIBRAseq data set?

I think this may be related to differences in our conceptions of what RepertoireGroup is for. In your examples, each cohort is a RepertoireGroup and the analysis is done between them. @schristley has expressed a similar view point here. But to me, one of the main points of RepertoireGroup is to be the fundamental unit of analysis, as I've advocated for a while.

scharch commented 7 months ago

Should RepertoireGroup include a study_id? It wouldn't be relevant to an ADC download, but a metaanlysis would be a new Study, right?

bcorrie commented 7 months ago

The big thing that's missing is the ability to extract and combine parts of Repertoires. What if I want to analyze only IgG? Or only CD27+CD21- activated B cells from a CITEseq data set or only RBD-binding B cells from a LIBRAseq data set?

I think this may be related to differences in our conceptions of what RepertoireGroup is for. In your examples, each cohort is a RepertoireGroup and the analysis is done between them. @schristley has expressed a similar view point here. But to me, one of the main points of RepertoireGroup is to be the fundamental unit of analysis, as I've advocated for a while.

To do that, would it suffice to have a repertoire_id, data_processing_id, sample_processing_id to specify a part of a Repertoire? Or are you thinking of an even lower level of granularity than that? Do the different data processing and sample processing of the samples/data give you what you are looking for?

  {
    "repertoire_group_id" : "globally unique id 1",
    "repertoire_group_name" : "Early Recovery Cohort",
    "repertoire_group_description" : "study_id: PRJCA002413, subject_id: ERS, disease_diagnosis: DOID:0080600, ",
    "repertoires": [
       {"repertoire_id": "PRJCA002413-ERS1", "sample_processing_id":"XXX", "data_processing_id":"YYY"},
       {"repertoire_id": "PRJCA002413-ERS2", "sample_processing_id":"XXX", "data_processing_id":"YYY"}
    ]
  }

scharch commented 7 months ago

No, because I might want an arbitrary filter criteria that doesn't align with the original researcher's sample/data processing. EG:

Only IGHV4-family derived Rearrangements
Only T Cells identified as TEMRA by CellTypist
Only B Cells with non-zero CD11c in CellExpression 8-)

bcorrie commented 7 months ago

OK, now that is complicated... and I was afraid that was going to be your answer 8-)

Any ideas how to describe that???

bcorrie commented 7 months ago

And are we "simply" describing what is in the data set?

Or are we describing what one needs to extract from the data set? That is the TSV file contains all rearrangements and we say:

v_family = IGHV-4

to describe what needs to be extracted?

bcorrie commented 7 months ago

We could probably use an ADC query to describe what is in the data set???

scharch commented 7 months ago

We could probably use an ADC query to describe what is in the data set???

I was thinking along those lines, yes, but ultimately it probably needs to be free text.

And are we "simply" describing what is in the data set? Or are we describing what one needs to extract from the data set?

Again, ideally the latter. But practically, probably the former.

schristley commented 7 months ago

The big thing that's missing is the ability to extract and combine parts of Repertoires. What if I want to analyze only IgG? Or only CD27+CD21- activated B cells from a CITEseq data set or only RBD-binding B cells from a LIBRAseq data set?

I think this may be related to differences in our conceptions of what RepertoireGroup is for. In your examples, each cohort is a RepertoireGroup and the analysis is done between them. @schristley has expressed a similar view point here. But to me, one of the main points of RepertoireGroup is to be the fundamental unit of analysis, as I've advocated for a while.

@scharch Yes, and I agree with your advocation! The first pass of RepertoireGroup was purposefully simple, group repertoires together (as a whole) so you can do computations both intra- and inter-group. This works great for its purpose, I got it integrated into the VDJServer pipelines

Now I say, "as a whole", but I'm lying. I'm already doing some filtering because I almost always want just productive rearrangements, so I filter out the nonproductives. I've always had it in my mind to extend that filtering capability, and have delayed doing anything, but I increasingly need it.

My anecdote? In a collaboration, I'm asked, can you give them those stats (gene usage, etc) for just IGHV4? No problem, I'll write a one-off script to pull out the rearrangements and viola. Oh, if it isn't hard, give it to me for all VH families. Then two days later, can you give me IGHV4 but also separated by J family? You see where this is going... Before you know it, my one-off script is multiple scripts with multiple parameters. I have a multi-TB analysis tree with dozen of directories where the rearrangements have been duplicated numerous times... Exactly what I didn't want!

We could probably use an ADC query to describe what is in the data set???

That's exactly what I was thinking. When I started thinking about how to generalize filtering, what makes sense is to use the ADC query language to describe the filter instead of inventing something new. This isn't too hard to implement so long as you are not that concerned about efficiency, you compute the expression tree implied by the ADC query on each rearrangement record, and it comes out true or false.

But, and this is a big butt (sorry ;-), I've only been thinking in the context of filtering Rearrangements. If we want a more general filtering which works on Clones and Cells and whatever, that starts getting more complicated. Also, the filtering is only on fields of that one type, i.e. fields in the rearrangement file. If you want to express things that crosses types (e.g. filter on both Rearrangement and Cell fields) then it gets even more complicated.

schristley commented 7 months ago

We could probably use an ADC query to describe what is in the data set???

I was thinking along those lines, yes, but ultimately it probably needs to be free text.

That's almost worthless honestly. If that is all you want, stick it in repertoire_group_description and be done. The useful thing is a precise definition so that an analysis tool can do it for you.

One idea is try not to do everything in one chunk. We have the basic RepertoireGroup now, what's a simple but useful way to add filtering, even if it doesn't give us everything?

schristley commented 7 months ago

To do that, would it suffice to have a repertoire_id, data_processing_id, sample_processing_id to specify a part of a Repertoire?

No, but actually we need this anyways. The simple repertoire_id isn't sufficient. It works now because we mostly limit ourselves to a single data processing.

scharch commented 7 months ago

That's almost worthless honestly. If that is all you want, stick it in repertoire_group_description and be done. The useful thing is a precise definition so that an analysis tool can do it for you.

I agree with the principle, of course. But I was more thinking about

Also, the filtering is only on fields of that one type, i.e. fields in the rearrangement file. If you want to express things that crosses types (e.g. filter on both Rearrangement and Cell fields) then it gets even more complicated.

I think that the "right" answer is probably a DataProcessing-type object, although I'm not sure that gets all the way to the rigor you are looking for.

schristley commented 7 months ago

Also, the filtering is only on fields of that one type, i.e. fields in the rearrangement file. If you want to express things that crosses types (e.g. filter on both Rearrangement and Cell fields) then it gets even more complicated.

I think that the "right" answer is probably a DataProcessing-type object, although I'm not sure that gets all the way to the rigor you are looking for.

If we don't try to do everything but think that allowing filters on object types would be useful, then a simple enhancement to RepertoireGroup should be able to provide this. Like with the manifest, we could add a type-keyed filters which provides the filter, in ADC query format. For example, here is a filter on Rearrangement indicating to include productive records for VH4 family:

RepertoireGroup:
  - repertoire_group_id: tumor
    repertoires:
      - repertoire_id: 33983-3T-30_S14
      - repertoire_id: 32277-3T-36_S20
      - repertoire_id: 30174-2T-29_S13
    filters:
      Rearrangement:
        op: and
        content:
          - op: =
            content:
              field: productive
              value: true
          - op: =
            content:
              field: v_subgroup
              value: IGHV4

and with the type-key, that allows us to specify more, so here's the same example but also provide a filter on clones.

RepertoireGroup:
  - repertoire_group_id: tumor
    repertoires:
      - repertoire_id: 33983-3T-30_S14
      - repertoire_id: 32277-3T-36_S20
      - repertoire_id: 30174-2T-29_S13
    filters:
      Rearrangement:
        op: and
        content:
          - op: =
            content:
              field: productive
              value: true
          - op: =
            content:
              field: v_subgroup
              value: IGHV4
      Clone:
        op: =
        content:
            field: v_subgroup
            value: IGHV4

At least for my test case, this would be highly useful. I could create groups for each of the combinations of V and J families that I need, and yes it would be a lot, but then I could run repcalc once, it would process all of the groups, generate output files for all of those groups, and I would never have to duplicate the rearrangements files.

schristley commented 7 months ago

Also, the filtering is only on fields of that one type, i.e. fields in the rearrangement file. If you want to express things that crosses types (e.g. filter on both Rearrangement and Cell fields) then it gets even more complicated.

I think that the "right" answer is probably a DataProcessing-type object, although I'm not sure that gets all the way to the rigor you are looking for.

Doing something like, give me the Rearrangement that are in a Cell where the CellExpression of the gene ABC > 5, would require a query language with power like SQL. I don't see that happening. And then yes, you'd likely write a custom script, describe the use of that script in DataProcessing, and like with my use case above, I would still use RepertoireGroup but my tool would not use the raw rearrangement files as input but instead the post-filtered rearrangements output by the custom script.

scharch commented 7 months ago

Doing something like, give me the Rearrangement that are in a Cell where the CellExpression of the gene ABC > 5, would require a query language with power like SQL. I don't see that happening.

Yeah, this is where I was coming from, but there's definitely plenty of utility in simpler cases. I want to sleep on it a bit, but what you've proposed above seems pretty good to me at first blush.

scharch commented 4 months ago

If we don't try to do everything but think that allowing filters on object types would be useful, then a simple enhancement to RepertoireGroup should be able to provide this. Like with the manifest, we could add a type-keyed filters which provides the filter, in ADC query format.

@schristley I'm thinking maybe the filters should be per Repertoire, not necessarily over the whole RepertoireGroup. What do you think about that? More generally, how should the filters be described in the schema? I guess we'll need to provide an explicit list of type keys like we have for DataFile, but in theory op and content are arbitrarily recursive, so ...???

schristley commented 4 months ago

@schristley I'm thinking maybe the filters should be per Repertoire, not necessarily over the whole RepertoireGroup. What do you think about that?

Sure, just put the filters with each repertoire entry. I wouldn't want to get rid of the filters at the RepertoireGroup level though because that's convenient (and less prone to error) if you want the same filter applied to all repertoires.

More generally, how should the filters be described in the schema? I guess we'll need to provide an explicit list of type keys like we have for DataFile, but in theory op and content are arbitrarily recursive, so ...???

Not easily. We don't try with the ADC API and instead just describe in the docs.

bcorrie commented 4 months ago

Doing something like, give me the Rearrangement that are in a Cell where the CellExpression of the gene ABC > 5, would require a query language with power like SQL. I don't see that happening.

Yeah, this is where I was coming from, but there's definitely plenty of utility in simpler cases. I want to sleep on it a bit, but what you've proposed above seems pretty good to me at first blush.

As @schristley says, doing this as a single query in the ADC is probably unlikely. With that said this is very possible with multiple queries across the ADC endpoints. In fact, the above query is exactly what the iReceptor Gateway Cell page does currently.

This is a search for Expression of TRBV4-1 > 10 across all T-cell repertoires in a specific study. If you click Download you get all the Rearrangements, Cells, and Expression data.

So the ADC queries are powerful, you "just" can't do complicated joins across the collections. The iReceptor Gateway does 1 repertoire, 12 expression, 10 cell, and 10 rearrangement queries to gather the data presented in the page below 8-)

bcorrie commented 4 months ago

With that said, finding all the Cells with that expression level is one query:

curl -s -d @query2.json https://covid19-1.ireceptor.org/airr/v1/expression

where query2.json is:

{
    "filters": {
        "op": "and",
        "content": [
            {
                "op": "=",
                "content": {
                    "field": "property.label",
                    "value": "TRBV4-1"
                }
            },
            {
                "op": ">",
                "content": {
                    "field": "value",
                    "value": 10
                }
            },
            {
                "op": "in",
                "content": {
                    "field": "repertoire_id",
                    "value": [
                        "PRJCA002413-ERS1-TR-CELL",
                        "PRJCA002413-ERS2-TR-CELL",
                        "PRJCA002413-ERS3-TR-CELL",
                        "PRJCA002413-ERS4-TR-CELL",
                        "PRJCA002413-ERS5-TR-CELL",
                        "PRJCA002413-LRS1-TR-CELL",
                        "PRJCA002413-LRS2-TR-CELL",
                        "PRJCA002413-LRS3-TR-CELL",
                        "PRJCA002413-LRS4-TR-CELL",
                        "PRJCA002413-LRS5-TR-CELL",
                        "PRJCA002413-Healthy_Control_1-TR-CELL",
                        "PRJCA002413-Healthy_Control_2-TR-CELL",
                        "PRJCA002413-Healthy_Control_3-TR-CELL",
                        "PRJCA002413-Healthy_Control_4-TR-CELL",
                        "PRJCA002413-Healthy_Control_5-TR-CELL"
                    ]
                }
            }
        ]
    }
}

airr-community / airr-standards

RepertoireGroup refinements #578