Closed javh closed 1 year ago
My main point here (https://github.com/airr-community/airr-standards/issues/347#issuecomment-1047211943) is that as a data curator, I need to preserve information in my repository so that I can figure out why my rearrangements/clones/cells that are loaded are looking suspicious. When things seem odd in my repository, I need to be able to figure out why. If I lose the information from the annotation tool that links info together in a specific analysis (such as cell_id
) and how things are linked in the original data files I am in trouble.
Personally, I think if we do this as custom fields I think we are NOT doing the data curator any favors - everyone then does it differently or perhaps doesn't do it at all (because they don't know they should).
I think the standard should make provenance like this simple.
I am intrigued as to what a provenance object would look like and how it might help - thoughts?
@bcorrie For my clarification, are you only referring to ID fields that conflict with AIRR id fields? Or are you also talking about additional non-AIRR id fields that tools might assign (e.g., clonotype_id
for 10x)? And/or are you also talking about custom non-id fields that don't map to AIRR (e.g., is_cell
for 10x)?
- @javh thinks provenance should go in a provenance object.
A provenance object is an interesting idea; essentially formalizing the custom approach. A simple solution may be a mapping table between identifiers. However, before we go along that path, what's the scope? Will it be a defined AIRR object, will there be an ADC endpoint to query and download? If it's purely a private/internal table for a data repository, then I don't see spending AIRR-C resources on defining. Or do we consider this fundamental AIRR provenance?
Fight!!!
Rule 1 of Fight Club is don't talk about Fight Club! ;-D
@bcorrie For my clarification, are you only referring to ID fields that conflict with AIRR id fields? Or are you also talking about additional non-AIRR id fields that tools might assign (e.g.,
clonotype_id
for 10x)? And/or are you also talking about custom non-id fields that don't map to AIRR (e.g.,is_cell
for 10x)?
All, none, both... 8-)
I am pointing out that there is a tension between information that annotation tools generate (e.g. identifiers for specific cells and clones) and the AIRR Standards definition of fields that play the same or similar roles (e.g. identifiers for specific cells and clones).
Lets face it, the AIRR Standard's fields for something like an identifier for a specific cell (cell_id
) are driven by the use of the same concept by the annotation tools. The reason the AIRR Standard has them is because they are conceptually needed - and that is the same reason the annotation tools produce them. So I don't think it correct to say that _id
fields from an annotation tool like cellranger's cell_id
"conflict" with the AIRR cell_id
field. They are capturing the same concept.
I think what we want to make sure we don't do is lose something that is valuable. So if the 10X concepts are valuable (e.g. is_cell
), they presumably should be mapped into some sort of AIRR standard concept, no.
A simple solution may be a mapping table between identifiers.
As far as I am concerned, at least at a very simple level, all the AIRR Standard is is a giant mapping of fields - and "we've already got one..." - https://www.youtube.com/watch?v=Ea8GyscSFaQ
https://github.com/sfu-ireceptor/config
Consider cell_id from 10X as an example use case - do we mean:
cell_id
== 10x cell_id
cell_id
== new uniqueID(), 10x cell_id
> /dev/nullcell_id
== new uniqueID(), AIRR custom cell_id_annotation_tool
= 10X cell_id
cell_uid
== new uniqueID(), AIRR cell_id
= 10X cell_id
I like the last one the best - with a nominal mapping of an annotation tools concept of a ID for a Cell
== cell_id
and the notion of an ADC concept that has unique/PID constraints being an ID by itself.
I get sad when I see data generated by a tool being marked, by the AIRR Standard no less, as not useful and being sent to /dev/null. It just seems like a bad idea... 8-)
- AIRR
cell_id
== 10xcell_id
- AIRR
cell_id
== new uniqueID(), 10xcell_id
> /dev/null- AIRR
cell_id
== new uniqueID(), AIRR customcell_id_annotation_tool
= 10Xcell_id
- AIRR
cell_uid
== new uniqueID(), AIRRcell_id
= 10Xcell_id
None of the above, I like:
cell_id
, 10x cellular_id
for linking to external 10x filesThe fields are independent, so no conflict. AIRR cell_id
follows all of the AIRR (and ADC) requirements, while 10x has a custom field for linking to non-AIRR external 10x files.
So my question is, since almost every single cell tool produces a cellular_id
why don't we have such a field in the AIRR Standard to represent this. This is not a tool specific field - unless you can point out a tool that doesn't provide an ID for a Cell
object that it identifies. I am pretty sure all single-cell tools have such a field to link Cell
objects within their data representations.
I do not believe we should be throwing this away or relying on custom fields (and therefore not interoperable and reusable) for such information.
Fight!!!
I am pointing out that there is a tension between information that annotation tools generate (e.g. identifiers for specific cells and clones)
AN _id
IS NOT INFORMATION! AN _id
IS NOT INFORMATION! AN _id
IS NOT INFORMATION!
😂😂😂
More seriously, I am very much against
- AIRR
cell_uid
== new uniqueID(), AIRRcell_id
= 10Xcell_id
for the reasons discussed ad nauseum in the other threads. Beyond that, I care relatively little whether the original cell_id
gets discarded, goes in a custom field, goes in a new AIRR-reserved field, or goes in a new AIRR-defined Provenance
object. I think @schristley is right that
If it's purely a private/internal table for a data repository, then I don't see spending AIRR-C resources on defining.
But I guess I don't see the harm in ratifying whatever @bcorrie is already using for this anyway 😉
I think attributing inherent value, that should be preserved, to user data is a distraction. VDJServer and the iReceptor gateway are also tools. Cellranger munges input fastq data, iReceptor munges input AIRR data... it's the circle of life.
I like this comment by @bcorrie:
I need to preserve information in my repository so that I can figure out why my rearrangements/clones/cells that are loaded are looking suspicious. When things seem odd in my repository, I need to be able to figure out why.
This is something meaty to chew on. I read this need as a changelog - not denoting input data as immutable. I don't know what a provenance object would look like. I think we need to nail down the use case, scope, etc. But, maybe something like:
Provenance:
properties:
sequence_id:
type: string
description: sequence_id of the current record (linker to live version of the record).
dates:
type: array
description: List of date/time for each historical change.
data_processing_ids:
type: array
description: List of data_processing_id values associated with each historical change.
sequence_provenance:
type: array
description: List of historical sequence_id values.
cell_provenance:
type: array
description: List of historical cell_id values.
Where all arrays are required to be equal length, filling in null
values when there is no change to a field at the given date.
Would something along these lines answer the "something went wrong, why?" question?
I mentioned this on the previous thread, but it would seem that the Provenance
object @javh sketches would also need to include the original file names, since the whole point is that the "historical cell_id values" are not unique. But a file name in turn implies a path/repository where said file names can be accessed. And that path/repository would seem, by definition, to be outside the ADC ...and that's where this breaks down for me in terms of making this part of the schema, as opposed to "purely a private/internal table."
@scharch Yeah, good point. I think that's covered by DataProcessing:data_processing_files
.
Personally, I'd approach this problem using a human readable log output by whatever loads the data, but... I'm not opposed to trying to formalize this for the ADC context.
Responding to @bcorrie from https://github.com/airr-community/airr-standards/issues/347#issuecomment-1048056738:
So when I process some 10X studies (N samples from one study and M samples from another study) generating AIRR compliant files in preparation for analysis, I replace the source 10X cell_id with a unique AIRR cell_id to make sure
cell_id
is unique across my analysis of interest.Now I want to confirm that the data I just processed for a certain 10X cell_id (TACGGATGTACACCGC-1) from a single subject in my source data is correct across the data I am going to use for my analysis. I can't...
Similarly, if I want to look at an AIRR unique cell_id in my processed data and then find the source information in the original 10X produced data files. Again, I can't...
So we have broken the link between the data in the AIRR compliant files to the original source data - data/"information" can no longer be mapped between the two...
Now if you truly trust the tools that do all of that processing, then maybe you don't want to do any provenance or reproducibility checks... But that is not how I would do things 8-)
Here is an example of what you get from a repository with our current implementation. If I maintain the annotation tool cell_id in some form, I can cross check the validity of the data I loaded with the original 10X files. If I don't, I can't... If you are a data steward maintaining an ADC repository, this is an important step...
Basically I want to be able to ensure that
cell_id_annotation_tool
= TACGGATGTACACCGC-1 links the correct data in the original 10X files (ERS1-TRA.tsv, ERS1-vdj_t_gex.json, ERS1-vdj_t-cells.json) that I as the data curator have maintained...
I think you are making the point for me that this is an inherently local function. No matter how we implement cell_id_annotation_tool
, you will always need access to "the original 10X files (ERS1-TRA.tsv, ERS1-vdj_t_gex.json, ERS1-vdj_t-cells.json)." So immediately we've ruled out this being a useful field for anything I download from the ADC (or even get from a collaborator, frankly).
I'm not ruling out adding Provenance
to the schema for local use. But we've never really figured out how we want to include/describe links between files (as opposed to data) in the schema. What if the path becomes stale? How do you ensure that the original data hasn't gotten corrupted? We could of course decide to ignore these issues and just report the links, but it certainly limits the applicability of your use case.
And then we still have to circle back to @schristley's point, which is that since the AIRR schema is designed to foster data sharing, how appropriate/useful is it to spend effort on what would be an inherently local/private schema object? I think there's a good case to be made that it can/should be included under the reproducibility part of the rubric, even if not the sharing part, but it's far from a forgone conclusion...
I need to preserve information in my repository so that I can figure out why my rearrangements/clones/cells that are loaded are looking suspicious. When things seem odd in my repository, I need to be able to figure out why.
This is something meaty to chew on. I read this need as a changelog - not denoting input data as immutable. I don't know what a provenance object would look like. I think we need to nail down the use case, scope, etc. But, maybe something like:
I would include in an object that records the actual change. In IT systems, these are sometimes called audit tables, i.e., recording when a data field is changed so you can audit those changes at a later time. Our use might be more specific so the structure might be simpler.
Audit:
properties:
field_name:
type: string
description: field that was changed
old_value:
type: string
description: old value for field
new_value:
type: string
description: new value for field
data_type:
type: string
description: data type of field
enum:
- string
- number
Then you record a list of these objects. You could also add a date-time for when the changed occurred and a username who did the change, but not sure if we need those.
Now a basic script can walk through each data record, and for each field_name
change new_value
back to old_value
, and save the data. That should give you the "original" file contents.
Now you need to define a scope that these changes apply to. This can be done in various ways. One way is if you want all of these audit records to be in a single table, then you need to additional fields (foreign keys) that specify what data record was changed. Another approach is there is a an audit table for each object type like repertoire_audit
, rearrangement_audit
and so on. That then implicitly defines the scope for the changes.
I don't think file names and data processing are needed. If we stick to the limited scenario where _id
values are changed when data is loaded into ADC, then "reversion" is downloading the data from the ADC and rewriting the _id
fields with their old values. You can put data in whatever file you want.
This is something meaty to chew on. I read this need as a changelog - not denoting input data as immutable. I don't know what a provenance object would look like. I think we need to nail down the use case, scope, etc. But, maybe something like:
And a provenance use case moving between versions of an AIRR Standard release:
This is not new, but please don't take away fields that we need for provenance and reproducibility...
I think you are making the point for me that this is an inherently local function. No matter how we implement
cell_id_annotation_tool
, you will always need access to "the original 10X files (ERS1-TRA.tsv, ERS1-vdj_t_gex.json, ERS1-vdj_t-cells.json)." So immediately we've ruled out this being a useful field for anything I download from the ADC (or even get from a collaborator, frankly).
That is the whole point - this isn't meant to be useful to you as a consumer of my data, it is meant to be useful to me as a curator so I can make sure the data you reuse is accurate...
It is a mistake to look at the AIRR Standard from just the perspective of someone who uses the data. If you don't want to use this field, then don't use it, but don't prevent others (a data curator) from having the ability to use the field if there is a use case for it!
If you don't want to use this field, then don't use it, but don't prevent others (a data curator) from having the ability to use the field if there is a use case for it!
No one has suggested preventing you from using the field, the question is whether or not to design and implement it as part of the AIRR schema.
Since there is general openness to the idea, my questions are:
Provenance
or Audit
object make sense without explicitly linking to/encapsulating the original data files?My own sense is that the answers to both are negative, but it's your use case, so I guess you tell me. But again, even if I am correct and we don't implement this as part of the schema, it doesn't stop you from doing it on your own in whatever way works best for you...
No one has suggested preventing you from using the field, the question is whether or not to design and implement it as part of the AIRR schema.
If there is a use case from the AIRR Community for it (e.g. those that curate data for reuse), then it should be part of the standard... If it isn't part of the standard then it is next to useless for the AIRR Community...
@bcorrie reserving my remaining philosophical disagreements, let's do the practical side - please answer my questions above
If there is a use case from the AIRR Community for it (e.g. those that curate data for reuse), then it should be part of the standard... If it isn't part of the standard then it is next to useless for the AIRR Community...
The use case should be generally applicable though. We don't, as a pseudo-rule, define fields/objects for use cases specific to 1-2 tools. We encourage custom fields in that case. Though, we've usually made exception for the ADC or things that are really obviously necessary.
So... I think if we should focus on finding a solution that covers the needs of VDJServer and iReceptor specifically, under the assumption that it'll be applicable to the ADC as a whole, then that should be good enough.
I don't think file names and data processing are needed. If we stick to the limited scenario where _id values are changed when data is loaded into ADC, then "reversion" is downloading the data from the ADC and rewriting the _id fields with their old values. You can put data in whatever file you want.
I rather like the json audit table... It seems like @bcorrie wants the data file links though. What about input
and output
references to files / records in the ADC? In the spirit of WDL task definitions:
https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#task-definition
The use case should be generally applicable though. We don't, as a pseudo-rule, define fields/objects for use cases specific to 1-2 tools. We encourage custom fields in that case. Though, we've usually made exception for the ADC or things that are really obviously necessary.
Are there any single-cell tools that don't have a cell_id concept as a link between files/objects in its internal data model? So I would argue it isn't specific to 1-2 tools, it is specific to the concept of a Cell
and how single cell tools process such data.
So... I think if we should focus on finding a solution that covers the needs of VDJServer and iReceptor specifically, under the assumption that it'll be applicable to the ADC as a whole, then that should be good enough.
As I say, this argument isn't specific to the ADC. It applies to files on disk as well. If you transform data from the 10X file format for two samples into AIRR repertoire, rearrangement, clone, and cell files for processing by an AIRR compliant tool (replacing the 10X cell_id with a unique ID), would you be able to tell me if the data in the AIRR files are a correct transformation of the source data in the 10X files? I don't think you can. If not, you have a non-reproducible data transformation.
Since there is general openness to the idea, my questions are:
- Would a
Provenance
orAudit
object make sense without explicitly linking to/encapsulating the original data files?
I can't answer that question, I don't know what these are or what the look like. I do know that the main reasons for our sequence_file
and data_processing_files
are for provenance and reproducibility - so to date we have been unable to provide these features without linking to files.
- If not, are we able to preserve (enough of) that linkage to make the cost/benefit of adding one to the schema worthwhile?
I am not sure what you are asking here. What linkage are we preserving and what costs/benefits are we talking about...
Are there any single-cell tools that don't have a cell_id concept as a link between files/objects in its internal data model? So I would argue it isn't specific to 1-2 tools, it is specific to the concept of a
Cell
and how single cell tools process such data.
The need to preserve original identifiers is what is tool-specific. Modifying the cell identifiers is common and I can't think of a single-cell tool that cares about tracking the changes to cell_id
they may have made. If you want reproducibility, then you provide it via code describing the steps and tool version numbers, not via some sort of schema-defined diff on the files.
I don't see this being applicable outside VDJServer and iReceptor. At least, initially. Which is perfectly fine. It'll give us a clear picture of the problem that needs to be solved. If we try to generalize this to hypothetical tools/problems, then I think we'll get nowhere.
- @bcorrie wants a mechanism to preserve original identifier fields uploaded by users into the ADC.
Just to be clear about @javh 's initial statement. I DO NOT want a mechanism to preserver original fields uploaded by users into the ADC. I DO WANT to preserve fields that we think are important to the use cases we have in the AIRR Community by making them part of the AIRR Standard.
Again, to be clear, in regards to this issue, I don't want provenance on individual AIRR fields (to be able to track changes to fields). I want to be able to have data provenance and reproducibility on data that is converted from common tool chains (e.g. 10X) into the AIRR Standard data formats for repertoires, rearrangements, clones, and cells.
In my opinion the AIRR Standard should facilitate this, not hinder it.
@bcorrie, All right, I think we need a more precise definition of your problem then.
How is what you want not accomplished via, say, providing a WDL file and docker container for the conversion steps? Or maintaining a copy of the files the user originally uploaded? How is the AIRR Standard even relevant to this problem?
Again, to be clear, in regards to this issue, I don't want provenance on individual AIRR fields (to be able to track changes to fields). I want to be able to have data provenance and reproducibility on data that is converted from common tool chains (e.g. 10X) into the AIRR Standard data formats for repertoires, rearrangements, clones, and cells.
In my opinion the AIRR Standard should facilitate this, not hinder it.
@bcorrie Expanding this to a broader context makes it sound like #313 , if this is your intent then I don't think we need a separate duplicate issue. Close this issue and use the original one. You already have a team working on reproducibility so you should be able to provide specific suggestions on how the AIRR Standards should be updated. Maybe the preliminary data processing design is sound, and you just need to start specifying it in more detail.
If somehow you mean something independent from data processing reproducibility then you need to clearly state how this issue is different, and update the issue title and main comment to reflect that.
If somehow you mean something independent from data processing reproducibility then you need to clearly state how this issue is different, and update the issue title and main comment to reflect that.
I didn't create this issue... 8-)
All I want is a cell_id
(as per the tool definition) and a cell_uid
or cell_pid
(that can be used broadly and changed at will)
See here: https://github.com/airr-community/airr-standards/issues/347#issuecomment-1028447479
Change the names as you see fit, but this would solve my problem... 8-)
All I want is a
cell_id
(as per the tool definition) and acell_uid
orcell_pid
(that can be used broadly and changed at will)
These are just arbitrary names, what if we call them cell_ref
(as per the tool definition) and cell_id
(the AIRR identifier)?
All I want is a
cell_id
(as per the tool definition) and acell_uid
orcell_pid
(that can be used broadly and changed at will)These are just arbitrary names, what if we call them
cell_ref
(as per the tool definition) andcell_id
(the AIRR identifier)?
I believe that would work - as long as they are both part of the AIRR standard and cell_ref isn't a "custom" field.
I believe that would work - as long as they are both part of the AIRR standard and cell_ref isn't a "custom" field.
Why can't it be a custom field? cell_ref
would just be a comment then, with no meaning absent the raw files. Seems very custom to me.
Let's take a concrete example:
cell_id
goes through the following edits:
ATGC
observed in both sample A and B. Sample A has two instances of ATGC
.ATGC-1
observed in both sample A and B. Sample A also has ATGC-2
. No mapping between FASTQ header and -1
and -2
.2-ATGC-1
and 2-ATGC-2
in sample A, 1-ATGC-1
in sample B. User didn't specify 2-
= A
, but you figured it out somehow.2-ATGC-1
-> 1
, 2-ATGC-2
-> 400
, 1-ATGC-1
-> 20
. No record of how it made these changes - assumed to be row number of input data.None are globally unique. User uploaded final analysis results (4) to the ADC and FASTQ (1) to SRA. How is cell_ref = {1, 400, 20}
valuable?
All I want is a
cell_id
(as per the tool definition) and acell_uid
orcell_pid
(that can be used broadly and changed at will)These are just arbitrary names, what if we call them
cell_ref
(as per the tool definition) andcell_id
(the AIRR identifier)?I believe that would work - as long as they are both part of the AIRR standard and cell_ref isn't a "custom" field.
To be more precise, cell_id
is the AIRR identifier field for linking AIRR objects in the AIRR Data Model. It should be unique in the local file context and will be overwritten with a globally unique and persistent identifier when the data is loaded into the ADC.
We need a better name than cell_ref
, maybe tool_cell_id
, it is assigned the original value of cell_id
assigned by the tool when the data is loaded into the ADC. Anybody that downloads data from the ADC can either use tool_cell_id
directly, or they can copy tool_cell_id
to cell_id
in the data to "reproduce" the original tool file.
This is the same solution I presented earlier, I just suggested *_original_id
names, but really we can call them anything.
I believe that would work - as long as they are both part of the AIRR standard and cell_ref isn't a "custom" field.
Why can't it be a custom field?
cell_ref
would just be a comment then, with no meaning absent the raw files. Seems very custom to me.
It is pretty custom which was why I was okay with doing something custom for VDJServer, but when it comes to implementation, I don't think these fields need to be put into the AIRR schema file. Instead, they can specified as rearrangement extensions for the ADC API. This makes sense to me because these fields are unneeded until data is loaded into the ADC. Thus they are technically a provenance feature for the ADC versus for AIRR Standards as a whole.
Instead, they can specified as rearrangement extensions for the ADC API.
Seems fine to me as long as cell_id
stores the gid/pid and doesn't get redefined to be the "original" identifier (whatever "original" means).
Though, I like your audit table idea better. It's really the same thing. Just in a fit-to-purpose object instead of Rearrangement.
Though, I like your audit table idea better. It's really the same thing. Just in a fit-to-purpose object instead of Rearrangement.
Me too in the sense that it's general and could be integrated throughout data processing. This other solution is really just a specific hack and doesn't handle the complexities that you point out in your example, but I feel Brian is fixated on just his one use case, 10x cells, and he hasn't thought through the larger context, i.e., what about clones, what about tools that generate their own repertoire_id and/or repertoire_group_id and so on. So we take this as an initial solution, but as the context becomes greater, it might evolve into something like our original ideas.
To be more precise,
cell_id
is the AIRR identifier field for linking AIRR objects in the AIRR Data Model. It should be unique in the local file context and will be overwritten with a globally unique and persistent identifier when the data is loaded into the ADC.
I want to emphasize that there is a strong coalition around this point: @bussec @javh @scharch and @schristley have all said/agreed with some version of it.
Anybody that downloads data from the ADC can either use
tool_cell_id
directly, or they can copytool_cell_id
tocell_id
in the data to "reproduce" the original tool file.
The above-named also all seem to be onboard with this, possibly with varying degrees of reluctance/eye-rolling about a reserved field vs a custom one. The problem seems to be that @bcorrie (correct me if I'm wrong) is unhappy with the prospect of this copy/overwrite operation and wants cell_id
to always and only reflect the pseudorandom string generated by the most recent (per @javh) processing tool to touch the data before upload to the ADC.
But unless I'm wrong about that math, I think we're all pretty clear on where we all stand and we don't need to continue rehearsing the same arguments...
To be more precise,
cell_id
is the AIRR identifier field for linking AIRR objects in the AIRR Data Model. It should be unique in the local file context and will be overwritten with a globally unique and persistent identifier when the data is loaded into the ADC.I want to emphasize that there is a strong coalition around this point: @bussec @javh @scharch and @schristley have all said/agreed with some version of it.
@bcorrie agrees as well - and has never stated such a field was not needed. We need an identifier at the AIRR Standard level that is unique in the context that it is being considered (local file, etc) and is controlled (i.e. can be over written) by the tools using it within that context. The ADC will use it in exactly the same way (in the ADC context), with the uniqueness and persistence criteria of the identifier in the ADC yet to be finalized.
The above-named also all seem to be onboard with this, possibly with varying degrees of reluctance/eye-rolling about a reserved field vs a custom one. The problem seems to be that @bcorrie (correct me if I'm wrong) is unhappy with the prospect of this copy/overwrite operation and wants
cell_id
to always and only reflect the pseudorandom string generated by the most recent (per @javh) processing tool to touch the data before upload to the ADC.
What @bcorrie has been saying for some time is that we also need the equivalent of tool_cell_id
, which has a uniqueness criteria within the DataProcessing
context that it was created. That is, it is the ID of the Cell
object as defined by the tool described in the DataProcessing
. IMNSHO this is critical to data curation, data provenance, and reproducibility. I would prefer that this be in the AIRR Standard, but I am OK with it being an ADC API extension if others insist that it is not "AIRR Standard worthy" (with the appropriate reluctance, eye roll, and heavy sigh). It looks like @schristley and I agree that this is probably OK so we can move forward with that...
Wait @bcorrie are you ok with this, too, then?
Seems fine to me as long as
cell_id
stores the gid/pid and doesn't get redefined to be the "original" identifier (whatever "original" means).
If so, does that mean:
Audit
object (etc) otherwise?Wait @bcorrie are you ok with this, too, then?
Seems fine to me as long as
cell_id
stores the gid/pid and doesn't get redefined to be the "original" identifier (whatever "original" means).
I think so - we have a different field to store the adc_tool_cell_id
("the original id") or whatever we call it as an ADC extension, so cell_id
is your AIRR cell_id
, a field whose uniqueness is AIRR context (file, analysis, ADC) specific.
If so, does that mean:
Not so sure, as this is just one identifier that we need to consider, but similar principles might apply
- You are not interested in an
Audit
object (etc) otherwise?
I am interested, I just don't think it is required to resolve the cell_id issue...
I am interested, I just don't think it is required to resolve the cell_id issue...
What is the plan to solve this issue with other _id
fields? Eg, clone_id
. Same thing? More extension fields?
Almost @scharch , we need to decide on some specifics. I suggested these two solutions. I think 1 is the way to go, but wanted to offer 2 in case somebody had compelling reasons why that was better. In particular, IEDB does 2 and the current GermlineSet
suggests also, but neither feels "better" to me. We should pick one solution for all AIRR objects to follow.
Then we actually have to decide on the structure of the identifier itself. While we can say CURIE-like, we should be specific about what we actually mean, because we have to specify how the resolvers work.
DARN!! Looks like I missed the season finale! Did we already discover that @bcorrie and @schristley are actually the same person? And get to the point where that person -- after being implicated in the demolition of a good chunk of SRA's infrastructure -- needs to go into hiding and starts setting up Rogue AIRR Repos on Raspberry Pis along Interstate 84? I wonder what kind of global IDs you would need for that! What a cliffhanger!
DARN!! Looks like I missed the season finale!
A doozy for sure. Like every good season finale, it left us with a cliff hanger with Scott pondering life as a blockchain developer instead.
And as always they show us snippets of the next season, including:
Meanwhile:
Coming to you all on the AIRR+ streaming network, for only $29.95 a month!
@javh edit: content.
The good news is it looks like we've arrived at a solution! Which is to add some sort of ID history fields to the ADC Rearrangement extension for at least sequence_id
and cell_id
.
Because this is essentially a custom field solution, I don't have an opinion on implementation. Just some loose suggestions on future-proofing:
clone_id
._id
to avoid confusion. Eg, _ref
, _rid
, _fd
, _comment
, etc. Dependent upon how we resolve persistence (#347) and whether you want this to be primarily an identifier in a file.I assume we can table the Provenance and Audit ideas, yes?
@javh edit: content.
I've decided to completely disengage from the AIRR Community until further notice.
@javh edit: content.
@schristley, understood. I'll follow up separately about whatever work needs to be done.
@javh edit: content.
closed for AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUGGGGGGGGGGGGGGGGGGGGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
Separating the question of identifier provenance from #347:
Initiating comment by @schristley is here: https://github.com/airr-community/airr-standards/issues/347#issuecomment-1018745723
Fight!!!