Generalization of DataProcessing

javh commented 4 years ago

Originally posted by @schristley in https://github.com/airr-community/airr-standards/pull/294#issuecomment-571331941

I've been thinking a bit about DataProcessing, specifically how we might generalize it. The one solution I was exploring is what has been suggested a few times, that is, to use the multiplicity (array) of DataProcessing to describe the different steps. Then, the simplest would be to attach tags to them like clone assignment, vdj assignment, etc. With a well-established set of tags, you could retrieve the specific processing you want.

However, there is a problem with this design, and it happens downstream based upon the policy this encourages. Specifically, it will encourage users to use multiple DataProcessing to describe different parts/steps of a single processing pipeline/workflow. As soon as they start doing that, people will need to answer questions like, which steps go together? what is the ordering of the steps, and so on? Before you know it, we have to design a workflow language that describes how all of those different steps go together. This is not what we (well me at least) want to do...

Therefore, I feel we should stick with the original design of DataProcessing which envisions a single object that covers the whole process.

Here we have two choices, within that single DataProcessing object, we can add additional fields for annotating clonal assignment, lineage trees, and other stuff.

The other choice is to not overload DataProcessing too much, but add directly to Repertoire. That is, we add a separate clone_processsing array as a sibling to data_processing. This seems okay to me because it is an immune repertoire specific concept, but we might get into the situation where we annotate more and more types of processing, and this becomes unwieldy. Then, it may be better to encapsulate everything in DataProcessing so Repertoire doesn't become polluted.

schristley commented 4 years ago

I feel we should be pragmatic in our design. Instead of generalizing too much, let's have a list of specific questions (use cases) that we want to be able to answer, then minimally extend DataProcessing to support them. I feel it's okay to be fairly specific to immune repertoire analysis, versus any general bioinformatic workflow. Example questions might be:

What tool(s) and parameter(s) was used to pre-process the raw sequencing data.
What tool(s) and parameter(s) was used to perform V(D)J assignment.
What tool(s) and parameter(s) was used to perform clonal assignment.

bcorrie commented 3 years ago

We are struggling with this as we are starting to load Clone data as well as Rearrangement data. And of course we will also have Cell data soon.

Two simple suggestions:

We could have a simple controlled vocabulary that describes the type of objects in the AIRR spec created from a DataProcessing. So we have one DataProcessing object, but can easily tell which DataProcessing in a Repertoire generated which type of object - Rearrangements, Clones, or Cells. Maybe have a simple processing_produced field that can take on values that represent the thing on the AIRR spec (rearrangements, clones, cells). If we are talking about DataProcessing objects that describe the processing that was used to create a set of objects (e.g a file) that exist in the AIRR spec (e.g. Rearrangements or Clones) then this captures a large percentage of our use cases I think, no?
We could have a simple "relationship" field between DataProcessing objects within a Repertoire - possibly a parent-child relationship, that would allow us to tell that the Clones created from a DataProcessing are related to the Rearrangements from another DataProcessing. This would in fact allow us to create simple pipelines.

Thoughts?

scharch commented 3 years ago

@bcorrie to make sure I understand: would that be meaningfully different from devolving DataProcessing into RearrangementProcessing, CloneProcessing, etc, as described above?

bcorrie commented 3 years ago

@bcorrie to make sure I understand: would that be meaningfully different from devolving DataProcessing into RearrangementProcessing, CloneProcessing, etc, as described above?

Not much... The main reason to have two different things would be if the internal fields were substantially different. If the fields were the same, it doesn't make much sense and would probably be a bad idea. In general, it makes sense to have one object if we can - so the question for me would be are the fields general enough across the types of processing?

The down side of having two different objects is that if we want to add a third type of processing, we have to add a third object to the spec - which is very heavy. With one object we just add a third keyword... For example, what do I do if I did some data processing of the single cell data (I want to capture that I used the cell ranger pipeline) - how do I capture that? Or how do I capture the data processing for building a clone tree? Or the next cool thing that comes down the pipe 8-)

bcorrie commented 3 years ago

What tool(s) and parameter(s) was used to pre-process the raw sequencing data.

What tool(s) and parameter(s) was used to perform V(D)J assignment.

What tool(s) and parameter(s) was used to perform clonal assignment.

I would add:

What tool(s) and parameter(s) was used to build clonal lineages
What tool(s) and parameter(s) was used to produce single cell data

Both are draft objects in the AIRR spec.

bcorrie commented 3 years ago

In #515 we talked about having keywords for a study (keywords_study) that specifically indicate what types of data are produced in a study. We discussed having a keywords_data field with a controlled vocabulary such as "generated_rearrangements", "generated_clones", "generated_cells", "generated_lineage", etc. Just wanted to capture the discussion we had at the Standards meeting here so we didn't lose track: https://github.com/airr-community/airr-standards/wiki/Minutes_Standards_2021-05

BTW, I just invented the above keywords - we did not discuss the specifics, we left that to this issue...

bcorrie commented 9 months ago

I think we have agreed that DataProcessing refactoring will not be done for v2.0, moving to v2.1

javh commented 9 months ago

It'd be nice to decide on this for the next call. The intent for v2.1 was just updates/correctioning/etc, so if this isn't v2.0 then I suspect it's just off the list.

bussec commented 9 months ago

@javh I added it to the agenda for the next call.

airr-community / airr-standards

Generalization of DataProcessing #313