Open javh opened 4 years ago
I feel we should be pragmatic in our design. Instead of generalizing too much, let's have a list of specific questions (use cases) that we want to be able to answer, then minimally extend DataProcessing
to support them. I feel it's okay to be fairly specific to immune repertoire analysis, versus any general bioinformatic workflow. Example questions might be:
We are struggling with this as we are starting to load Clone
data as well as Rearrangement
data. And of course we will also have Cell
data soon.
Two simple suggestions:
DataProcessing
. So we have one DataProcessing
object, but can easily tell which DataProcessing
in a Repertoire generated which type of object - Rearrangements
, Clones
, or Cells
. Maybe have a simple processing_produced
field that can take on values that represent the thing on the AIRR spec (rearrangements
, clones
, cells
). If we are talking about DataProcessing
objects that describe the processing that was used to create a set of objects (e.g a file) that exist in the AIRR spec (e.g. Rearrangements
or Clones
) then this captures a large percentage of our use cases I think, no?DataProcessing
objects within a Repertoire - possibly a parent-child relationship, that would allow us to tell that the Clones
created from a DataProcessing
are related to the Rearrangements
from another DataProcessing
. This would in fact allow us to create simple pipelines.Thoughts?
@bcorrie to make sure I understand: would that be meaningfully different from devolving DataProcessing
into RearrangementProcessing
, CloneProcessing
, etc, as described above?
@bcorrie to make sure I understand: would that be meaningfully different from devolving
DataProcessing
intoRearrangementProcessing
,CloneProcessing
, etc, as described above?
Not much... The main reason to have two different things would be if the internal fields were substantially different. If the fields were the same, it doesn't make much sense and would probably be a bad idea. In general, it makes sense to have one object if we can - so the question for me would be are the fields general enough across the types of processing?
The down side of having two different objects is that if we want to add a third type of processing, we have to add a third object to the spec - which is very heavy. With one object we just add a third keyword... For example, what do I do if I did some data processing of the single cell data (I want to capture that I used the cell ranger pipeline) - how do I capture that? Or how do I capture the data processing for building a clone tree? Or the next cool thing that comes down the pipe 8-)
- What tool(s) and parameter(s) was used to pre-process the raw sequencing data.
- What tool(s) and parameter(s) was used to perform V(D)J assignment.
- What tool(s) and parameter(s) was used to perform clonal assignment.
I would add:
Both are draft
objects in the AIRR spec.
In #515 we talked about having keywords for a study (keywords_study) that specifically indicate what types of data are produced in a study. We discussed having a keywords_data field with a controlled vocabulary such as "generated_rearrangements", "generated_clones", "generated_cells", "generated_lineage", etc. Just wanted to capture the discussion we had at the Standards meeting here so we didn't lose track: https://github.com/airr-community/airr-standards/wiki/Minutes_Standards_2021-05
BTW, I just invented the above keywords - we did not discuss the specifics, we left that to this issue...
I think we have agreed that DataProcessing refactoring will not be done for v2.0, moving to v2.1
It'd be nice to decide on this for the next call. The intent for v2.1 was just updates/correctioning/etc, so if this isn't v2.0 then I suspect it's just off the list.
@javh I added it to the agenda for the next call.
Originally posted by @schristley in https://github.com/airr-community/airr-standards/pull/294#issuecomment-571331941