How to approach identifier provenance

javh commented 2 years ago

Separating the question of identifier provenance from #347:

Initiating comment by @schristley is here: https://github.com/airr-community/airr-standards/issues/347#issuecomment-1018745723

@schristley seems content to implement custom Rearrangement fields (non-schema fields) to track original identifiers in VDJServer.
@bcorrie wants a mechanism to preserve original identifier fields uploaded by users into the ADC.
@javh thinks provenance should go in a provenance object.

Fight!!!

bcorrie commented 2 years ago

My main point here (https://github.com/airr-community/airr-standards/issues/347#issuecomment-1047211943) is that as a data curator, I need to preserve information in my repository so that I can figure out why my rearrangements/clones/cells that are loaded are looking suspicious. When things seem odd in my repository, I need to be able to figure out why. If I lose the information from the annotation tool that links info together in a specific analysis (such as cell_id ) and how things are linked in the original data files I am in trouble.

Personally, I think if we do this as custom fields I think we are NOT doing the data curator any favors - everyone then does it differently or perhaps doesn't do it at all (because they don't know they should).

I think the standard should make provenance like this simple.

I am intrigued as to what a provenance object would look like and how it might help - thoughts?

schristley commented 2 years ago

@bcorrie For my clarification, are you only referring to ID fields that conflict with AIRR id fields? Or are you also talking about additional non-AIRR id fields that tools might assign (e.g., clonotype_id for 10x)? And/or are you also talking about custom non-id fields that don't map to AIRR (e.g., is_cell for 10x)?

schristley commented 2 years ago

@javh thinks provenance should go in a provenance object.

A provenance object is an interesting idea; essentially formalizing the custom approach. A simple solution may be a mapping table between identifiers. However, before we go along that path, what's the scope? Will it be a defined AIRR object, will there be an ADC endpoint to query and download? If it's purely a private/internal table for a data repository, then I don't see spending AIRR-C resources on defining. Or do we consider this fundamental AIRR provenance?

Fight!!!

Rule 1 of Fight Club is don't talk about Fight Club! ;-D

bcorrie commented 2 years ago

@bcorrie For my clarification, are you only referring to ID fields that conflict with AIRR id fields? Or are you also talking about additional non-AIRR id fields that tools might assign (e.g., clonotype_id for 10x)? And/or are you also talking about custom non-id fields that don't map to AIRR (e.g., is_cell for 10x)?

All, none, both... 8-)

I am pointing out that there is a tension between information that annotation tools generate (e.g. identifiers for specific cells and clones) and the AIRR Standards definition of fields that play the same or similar roles (e.g. identifiers for specific cells and clones).

Lets face it, the AIRR Standard's fields for something like an identifier for a specific cell (cell_id) are driven by the use of the same concept by the annotation tools. The reason the AIRR Standard has them is because they are conceptually needed - and that is the same reason the annotation tools produce them. So I don't think it correct to say that _id fields from an annotation tool like cellranger's cell_id "conflict" with the AIRR cell_id field. They are capturing the same concept.

I think what we want to make sure we don't do is lose something that is valuable. So if the 10X concepts are valuable (e.g. is_cell), they presumably should be mapped into some sort of AIRR standard concept, no.

bcorrie commented 2 years ago

A simple solution may be a mapping table between identifiers.

As far as I am concerned, at least at a very simple level, all the AIRR Standard is is a giant mapping of fields - and "we've already got one..." - https://www.youtube.com/watch?v=Ea8GyscSFaQ

https://github.com/sfu-ireceptor/config

Consider cell_id from 10X as an example use case - do we mean:

AIRR cell_id == 10x cell_id
AIRR cell_id == new uniqueID(), 10x cell_id > /dev/null
AIRR cell_id == new uniqueID(), AIRR custom cell_id_annotation_tool = 10X cell_id
AIRR cell_uid == new uniqueID(), AIRR cell_id = 10X cell_id

I like the last one the best - with a nominal mapping of an annotation tools concept of a ID for a Cell == cell_id and the notion of an ADC concept that has unique/PID constraints being an ID by itself.

I get sad when I see data generated by a tool being marked, by the AIRR Standard no less, as not useful and being sent to /dev/null. It just seems like a bad idea... 8-)

schristley commented 2 years ago

AIRR cell_id == 10x cell_id

AIRR cell_id == new uniqueID(), 10x cell_id > /dev/null

AIRR cell_id == new uniqueID(), AIRR custom cell_id_annotation_tool = 10X cell_id

AIRR cell_uid == new uniqueID(), AIRR cell_id = 10X cell_id

None of the above, I like:

AIRR cell_id, 10x cellular_id for linking to external 10x files

The fields are independent, so no conflict. AIRR cell_id follows all of the AIRR (and ADC) requirements, while 10x has a custom field for linking to non-AIRR external 10x files.

bcorrie commented 2 years ago

So my question is, since almost every single cell tool produces a cellular_id why don't we have such a field in the AIRR Standard to represent this. This is not a tool specific field - unless you can point out a tool that doesn't provide an ID for a Cell object that it identifies. I am pretty sure all single-cell tools have such a field to link Cell objects within their data representations.

I do not believe we should be throwing this away or relying on custom fields (and therefore not interoperable and reusable) for such information.

scharch commented 2 years ago

Fight!!!

I am pointing out that there is a tension between information that annotation tools generate (e.g. identifiers for specific cells and clones)

AN _id IS NOT INFORMATION! AN _id IS NOT INFORMATION! AN _id IS NOT INFORMATION!

😂😂😂

More seriously, I am very much against

AIRR cell_uid == new uniqueID(), AIRR cell_id = 10X cell_id

for the reasons discussed ad nauseum in the other threads. Beyond that, I care relatively little whether the original cell_id gets discarded, goes in a custom field, goes in a new AIRR-reserved field, or goes in a new AIRR-defined Provenance object. I think @schristley is right that

If it's purely a private/internal table for a data repository, then I don't see spending AIRR-C resources on defining.

But I guess I don't see the harm in ratifying whatever @bcorrie is already using for this anyway 😉

javh commented 2 years ago

I think attributing inherent value, that should be preserved, to user data is a distraction. VDJServer and the iReceptor gateway are also tools. Cellranger munges input fastq data, iReceptor munges input AIRR data... it's the circle of life.

I like this comment by @bcorrie:

I need to preserve information in my repository so that I can figure out why my rearrangements/clones/cells that are loaded are looking suspicious. When things seem odd in my repository, I need to be able to figure out why.

This is something meaty to chew on. I read this need as a changelog - not denoting input data as immutable. I don't know what a provenance object would look like. I think we need to nail down the use case, scope, etc. But, maybe something like:

Provenance:
    properties:
        sequence_id:
            type: string
            description: sequence_id of the current record (linker to live version of the record).
        dates:
            type: array
            description: List of date/time for each historical change.
        data_processing_ids:
            type: array
            description: List of data_processing_id values associated with each historical change.
        sequence_provenance:
            type: array
            description: List of historical sequence_id values.
        cell_provenance:
            type: array
            description: List of historical cell_id values.

Where all arrays are required to be equal length, filling in null values when there is no change to a field at the given date.

Would something along these lines answer the "something went wrong, why?" question?

scharch commented 2 years ago

I mentioned this on the previous thread, but it would seem that the Provenance object @javh sketches would also need to include the original file names, since the whole point is that the "historical cell_id values" are not unique. But a file name in turn implies a path/repository where said file names can be accessed. And that path/repository would seem, by definition, to be outside the ADC ...and that's where this breaks down for me in terms of making this part of the schema, as opposed to "purely a private/internal table."

javh commented 2 years ago

@scharch Yeah, good point. I think that's covered by DataProcessing:data_processing_files.

Personally, I'd approach this problem using a human readable log output by whatever loads the data, but... I'm not opposed to trying to formalize this for the ADC context.

scharch commented 2 years ago

Responding to @bcorrie from https://github.com/airr-community/airr-standards/issues/347#issuecomment-1048056738:

So when I process some 10X studies (N samples from one study and M samples from another study) generating AIRR compliant files in preparation for analysis, I replace the source 10X cell_id with a unique AIRR cell_id to make sure cell_id is unique across my analysis of interest.

Now I want to confirm that the data I just processed for a certain 10X cell_id (TACGGATGTACACCGC-1) from a single subject in my source data is correct across the data I am going to use for my analysis. I can't...

Similarly, if I want to look at an AIRR unique cell_id in my processed data and then find the source information in the original 10X produced data files. Again, I can't...

So we have broken the link between the data in the AIRR compliant files to the original source data - data/"information" can no longer be mapped between the two...

Now if you truly trust the tools that do all of that processing, then maybe you don't want to do any provenance or reproducibility checks... But that is not how I would do things 8-)

Here is an example of what you get from a repository with our current implementation. If I maintain the annotation tool cell_id in some form, I can cross check the validity of the data I loaded with the original 10X files. If I don't, I can't... If you are a data steward maintaining an ADC repository, this is an important step...

Basically I want to be able to ensure that cell_id_annotation_tool = TACGGATGTACACCGC-1 links the correct data in the original 10X files (ERS1-TRA.tsv, ERS1-vdj_t_gex.json, ERS1-vdj_t-cells.json) that I as the data curator have maintained...

I think you are making the point for me that this is an inherently local function. No matter how we implement cell_id_annotation_tool, you will always need access to "the original 10X files (ERS1-TRA.tsv, ERS1-vdj_t_gex.json, ERS1-vdj_t-cells.json)." So immediately we've ruled out this being a useful field for anything I download from the ADC (or even get from a collaborator, frankly).

I'm not ruling out adding Provenance to the schema for local use. But we've never really figured out how we want to include/describe links between files (as opposed to data) in the schema. What if the path becomes stale? How do you ensure that the original data hasn't gotten corrupted? We could of course decide to ignore these issues and just report the links, but it certainly limits the applicability of your use case.

And then we still have to circle back to @schristley's point, which is that since the AIRR schema is designed to foster data sharing, how appropriate/useful is it to spend effort on what would be an inherently local/private schema object? I think there's a good case to be made that it can/should be included under the reproducibility part of the rubric, even if not the sharing part, but it's far from a forgone conclusion...

schristley commented 2 years ago

I need to preserve information in my repository so that I can figure out why my rearrangements/clones/cells that are loaded are looking suspicious. When things seem odd in my repository, I need to be able to figure out why.

This is something meaty to chew on. I read this need as a changelog - not denoting input data as immutable. I don't know what a provenance object would look like. I think we need to nail down the use case, scope, etc. But, maybe something like:

I would include in an object that records the actual change. In IT systems, these are sometimes called audit tables, i.e., recording when a data field is changed so you can audit those changes at a later time. Our use might be more specific so the structure might be simpler.

Audit:
  properties:
    field_name:
      type: string
      description: field that was changed
    old_value:
      type: string
      description: old value for field
    new_value:
      type: string
      description: new value for field
    data_type:
      type: string
      description: data type of field
      enum:
        - string
        - number

Then you record a list of these objects. You could also add a date-time for when the changed occurred and a username who did the change, but not sure if we need those.

Now a basic script can walk through each data record, and for each field_name change new_value back to old_value, and save the data. That should give you the "original" file contents.

Now you need to define a scope that these changes apply to. This can be done in various ways. One way is if you want all of these audit records to be in a single table, then you need to additional fields (foreign keys) that specify what data record was changed. Another approach is there is a an audit table for each object type like repertoire_audit, rearrangement_audit and so on. That then implicitly defines the scope for the changes.

I don't think file names and data processing are needed. If we stick to the limited scenario where _id values are changed when data is loaded into ADC, then "reversion" is downloading the data from the ADC and rewriting the _id fields with their old values. You can put data in whatever file you want.

bcorrie commented 2 years ago

This is something meaty to chew on. I read this need as a changelog - not denoting input data as immutable. I don't know what a provenance object would look like. I think we need to nail down the use case, scope, etc. But, maybe something like:

Use case: http://www.ireceptor.org/repositories/provenance

bcorrie commented 2 years ago

And a provenance use case moving between versions of an AIRR Standard release:

http://www.ireceptor.org/repositories/v3-0

bcorrie commented 2 years ago

This is not new, but please don't take away fields that we need for provenance and reproducibility...

bcorrie commented 2 years ago

I think you are making the point for me that this is an inherently local function. No matter how we implement cell_id_annotation_tool, you will always need access to "the original 10X files (ERS1-TRA.tsv, ERS1-vdj_t_gex.json, ERS1-vdj_t-cells.json)." So immediately we've ruled out this being a useful field for anything I download from the ADC (or even get from a collaborator, frankly).

That is the whole point - this isn't meant to be useful to you as a consumer of my data, it is meant to be useful to me as a curator so I can make sure the data you reuse is accurate...

It is a mistake to look at the AIRR Standard from just the perspective of someone who uses the data. If you don't want to use this field, then don't use it, but don't prevent others (a data curator) from having the ability to use the field if there is a use case for it!

scharch commented 2 years ago

If you don't want to use this field, then don't use it, but don't prevent others (a data curator) from having the ability to use the field if there is a use case for it!

No one has suggested preventing you from using the field, the question is whether or not to design and implement it as part of the AIRR schema.

Since there is general openness to the idea, my questions are:

Would a Provenance or Audit object make sense without explicitly linking to/encapsulating the original data files?
If not, are we able to preserve (enough of) that linkage to make the cost/benefit of adding one to the schema worthwhile?

My own sense is that the answers to both are negative, but it's your use case, so I guess you tell me. But again, even if I am correct and we don't implement this as part of the schema, it doesn't stop you from doing it on your own in whatever way works best for you...

bcorrie commented 2 years ago

No one has suggested preventing you from using the field, the question is whether or not to design and implement it as part of the AIRR schema.

If there is a use case from the AIRR Community for it (e.g. those that curate data for reuse), then it should be part of the standard... If it isn't part of the standard then it is next to useless for the AIRR Community...

scharch commented 2 years ago

@bcorrie reserving my remaining philosophical disagreements, let's do the practical side - please answer my questions above

javh commented 2 years ago

If there is a use case from the AIRR Community for it (e.g. those that curate data for reuse), then it should be part of the standard... If it isn't part of the standard then it is next to useless for the AIRR Community...

The use case should be generally applicable though. We don't, as a pseudo-rule, define fields/objects for use cases specific to 1-2 tools. We encourage custom fields in that case. Though, we've usually made exception for the ADC or things that are really obviously necessary.

So... I think if we should focus on finding a solution that covers the needs of VDJServer and iReceptor specifically, under the assumption that it'll be applicable to the ADC as a whole, then that should be good enough.

I don't think file names and data processing are needed. If we stick to the limited scenario where _id values are changed when data is loaded into ADC, then "reversion" is downloading the data from the ADC and rewriting the _id fields with their old values. You can put data in whatever file you want.

I rather like the json audit table... It seems like @bcorrie wants the data file links though. What about input and output references to files / records in the ADC? In the spirit of WDL task definitions: https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#task-definition

bcorrie commented 2 years ago

The use case should be generally applicable though. We don't, as a pseudo-rule, define fields/objects for use cases specific to 1-2 tools. We encourage custom fields in that case. Though, we've usually made exception for the ADC or things that are really obviously necessary.

Are there any single-cell tools that don't have a cell_id concept as a link between files/objects in its internal data model? So I would argue it isn't specific to 1-2 tools, it is specific to the concept of a Cell and how single cell tools process such data.

So... I think if we should focus on finding a solution that covers the needs of VDJServer and iReceptor specifically, under the assumption that it'll be applicable to the ADC as a whole, then that should be good enough.

As I say, this argument isn't specific to the ADC. It applies to files on disk as well. If you transform data from the 10X file format for two samples into AIRR repertoire, rearrangement, clone, and cell files for processing by an AIRR compliant tool (replacing the 10X cell_id with a unique ID), would you be able to tell me if the data in the AIRR files are a correct transformation of the source data in the 10X files? I don't think you can. If not, you have a non-reproducible data transformation.

bcorrie commented 2 years ago

Since there is general openness to the idea, my questions are:

Would a Provenance or Audit object make sense without explicitly linking to/encapsulating the original data files?

I can't answer that question, I don't know what these are or what the look like. I do know that the main reasons for our sequence_file and data_processing_files are for provenance and reproducibility - so to date we have been unable to provide these features without linking to files.

If not, are we able to preserve (enough of) that linkage to make the cost/benefit of adding one to the schema worthwhile?

I am not sure what you are asking here. What linkage are we preserving and what costs/benefits are we talking about...

javh commented 2 years ago

Are there any single-cell tools that don't have a cell_id concept as a link between files/objects in its internal data model? So I would argue it isn't specific to 1-2 tools, it is specific to the concept of a Cell and how single cell tools process such data.

The need to preserve original identifiers is what is tool-specific. Modifying the cell identifiers is common and I can't think of a single-cell tool that cares about tracking the changes to cell_id they may have made. If you want reproducibility, then you provide it via code describing the steps and tool version numbers, not via some sort of schema-defined diff on the files.

I don't see this being applicable outside VDJServer and iReceptor. At least, initially. Which is perfectly fine. It'll give us a clear picture of the problem that needs to be solved. If we try to generalize this to hypothetical tools/problems, then I think we'll get nowhere.

bcorrie commented 2 years ago

@bcorrie wants a mechanism to preserve original identifier fields uploaded by users into the ADC.

Just to be clear about @javh 's initial statement. I DO NOT want a mechanism to preserver original fields uploaded by users into the ADC. I DO WANT to preserve fields that we think are important to the use cases we have in the AIRR Community by making them part of the AIRR Standard.

bcorrie commented 2 years ago

Again, to be clear, in regards to this issue, I don't want provenance on individual AIRR fields (to be able to track changes to fields). I want to be able to have data provenance and reproducibility on data that is converted from common tool chains (e.g. 10X) into the AIRR Standard data formats for repertoires, rearrangements, clones, and cells.

In my opinion the AIRR Standard should facilitate this, not hinder it.

javh commented 2 years ago

@bcorrie, All right, I think we need a more precise definition of your problem then.

How is what you want not accomplished via, say, providing a WDL file and docker container for the conversion steps? Or maintaining a copy of the files the user originally uploaded? How is the AIRR Standard even relevant to this problem?

schristley commented 2 years ago

Again, to be clear, in regards to this issue, I don't want provenance on individual AIRR fields (to be able to track changes to fields). I want to be able to have data provenance and reproducibility on data that is converted from common tool chains (e.g. 10X) into the AIRR Standard data formats for repertoires, rearrangements, clones, and cells.

In my opinion the AIRR Standard should facilitate this, not hinder it.

@bcorrie Expanding this to a broader context makes it sound like #313 , if this is your intent then I don't think we need a separate duplicate issue. Close this issue and use the original one. You already have a team working on reproducibility so you should be able to provide specific suggestions on how the AIRR Standards should be updated. Maybe the preliminary data processing design is sound, and you just need to start specifying it in more detail.

If somehow you mean something independent from data processing reproducibility then you need to clearly state how this issue is different, and update the issue title and main comment to reflect that.

bcorrie commented 2 years ago

If somehow you mean something independent from data processing reproducibility then you need to clearly state how this issue is different, and update the issue title and main comment to reflect that.

I didn't create this issue... 8-)

bcorrie commented 2 years ago

All I want is a cell_id (as per the tool definition) and a cell_uid or cell_pid (that can be used broadly and changed at will)

See here: https://github.com/airr-community/airr-standards/issues/347#issuecomment-1028447479

Change the names as you see fit, but this would solve my problem... 8-)

schristley commented 2 years ago

All I want is a cell_id (as per the tool definition) and a cell_uid or cell_pid (that can be used broadly and changed at will)

These are just arbitrary names, what if we call them cell_ref (as per the tool definition) and cell_id (the AIRR identifier)?

bcorrie commented 2 years ago

All I want is a cell_id (as per the tool definition) and a cell_uid or cell_pid (that can be used broadly and changed at will)

These are just arbitrary names, what if we call them cell_ref (as per the tool definition) and cell_id (the AIRR identifier)?

I believe that would work - as long as they are both part of the AIRR standard and cell_ref isn't a "custom" field.

javh commented 2 years ago

I believe that would work - as long as they are both part of the AIRR standard and cell_ref isn't a "custom" field.

Why can't it be a custom field? cell_ref would just be a comment then, with no meaning absent the raw files. Seems very custom to me.

Let's take a concrete example:

Two sets of FASTQ files, one per sample (A and B), processed with cellranger.
Data loaded into R using the standard BioC tools.
Custom downstream analysis.

cell_id goes through the following edits:

FASTQ ~ ATGC observed in both sample A and B. Sample A has two instances of ATGC.
cellranger output ~ ATGC-1 observed in both sample A and B. Sample A also has ATGC-2. No mapping between FASTQ header and -1 and -2.
DropletUtils::read10xCounts ~ Results in 2-ATGC-1 and 2-ATGC-2 in sample A, 1-ATGC-1 in sample B. User didn't specify 2- = A, but you figured it out somehow.
Some downstream tool (I'm looking at you scVI), reassigned the cell identifiers to 2-ATGC-1 -> 1, 2-ATGC-2 -> 400, 1-ATGC-1 -> 20. No record of how it made these changes - assumed to be row number of input data.

None are globally unique. User uploaded final analysis results (4) to the ADC and FASTQ (1) to SRA. How is cell_ref = {1, 400, 20} valuable?

schristley commented 2 years ago

All I want is a cell_id (as per the tool definition) and a cell_uid or cell_pid (that can be used broadly and changed at will)

These are just arbitrary names, what if we call them cell_ref (as per the tool definition) and cell_id (the AIRR identifier)?

I believe that would work - as long as they are both part of the AIRR standard and cell_ref isn't a "custom" field.

To be more precise, cell_id is the AIRR identifier field for linking AIRR objects in the AIRR Data Model. It should be unique in the local file context and will be overwritten with a globally unique and persistent identifier when the data is loaded into the ADC.

We need a better name than cell_ref, maybe tool_cell_id, it is assigned the original value of cell_id assigned by the tool when the data is loaded into the ADC. Anybody that downloads data from the ADC can either use tool_cell_id directly, or they can copy tool_cell_id to cell_id in the data to "reproduce" the original tool file.

This is the same solution I presented earlier, I just suggested *_original_id names, but really we can call them anything.

schristley commented 2 years ago

I believe that would work - as long as they are both part of the AIRR standard and cell_ref isn't a "custom" field.

Why can't it be a custom field? cell_ref would just be a comment then, with no meaning absent the raw files. Seems very custom to me.

It is pretty custom which was why I was okay with doing something custom for VDJServer, but when it comes to implementation, I don't think these fields need to be put into the AIRR schema file. Instead, they can specified as rearrangement extensions for the ADC API. This makes sense to me because these fields are unneeded until data is loaded into the ADC. Thus they are technically a provenance feature for the ADC versus for AIRR Standards as a whole.

javh commented 2 years ago

Instead, they can specified as rearrangement extensions for the ADC API.

Seems fine to me as long as cell_id stores the gid/pid and doesn't get redefined to be the "original" identifier (whatever "original" means).

Though, I like your audit table idea better. It's really the same thing. Just in a fit-to-purpose object instead of Rearrangement.

schristley commented 2 years ago

Though, I like your audit table idea better. It's really the same thing. Just in a fit-to-purpose object instead of Rearrangement.

Me too in the sense that it's general and could be integrated throughout data processing. This other solution is really just a specific hack and doesn't handle the complexities that you point out in your example, but I feel Brian is fixated on just his one use case, 10x cells, and he hasn't thought through the larger context, i.e., what about clones, what about tools that generate their own repertoire_id and/or repertoire_group_id and so on. So we take this as an initial solution, but as the context becomes greater, it might evolve into something like our original ideas.

scharch commented 2 years ago

To be more precise, cell_id is the AIRR identifier field for linking AIRR objects in the AIRR Data Model. It should be unique in the local file context and will be overwritten with a globally unique and persistent identifier when the data is loaded into the ADC.

I want to emphasize that there is a strong coalition around this point: @bussec @javh @scharch and @schristley have all said/agreed with some version of it.

Anybody that downloads data from the ADC can either use tool_cell_id directly, or they can copy tool_cell_id to cell_id in the data to "reproduce" the original tool file.

The above-named also all seem to be onboard with this, possibly with varying degrees of reluctance/eye-rolling about a reserved field vs a custom one. The problem seems to be that @bcorrie (correct me if I'm wrong) is unhappy with the prospect of this copy/overwrite operation and wants cell_id to always and only reflect the pseudorandom string generated by the most recent (per @javh) processing tool to touch the data before upload to the ADC.

But unless I'm wrong about that math, I think we're all pretty clear on where we all stand and we don't need to continue rehearsing the same arguments...

bcorrie commented 2 years ago

To be more precise, cell_id is the AIRR identifier field for linking AIRR objects in the AIRR Data Model. It should be unique in the local file context and will be overwritten with a globally unique and persistent identifier when the data is loaded into the ADC.

I want to emphasize that there is a strong coalition around this point: @bussec @javh @scharch and @schristley have all said/agreed with some version of it.

@bcorrie agrees as well - and has never stated such a field was not needed. We need an identifier at the AIRR Standard level that is unique in the context that it is being considered (local file, etc) and is controlled (i.e. can be over written) by the tools using it within that context. The ADC will use it in exactly the same way (in the ADC context), with the uniqueness and persistence criteria of the identifier in the ADC yet to be finalized.

The above-named also all seem to be onboard with this, possibly with varying degrees of reluctance/eye-rolling about a reserved field vs a custom one. The problem seems to be that @bcorrie (correct me if I'm wrong) is unhappy with the prospect of this copy/overwrite operation and wants cell_id to always and only reflect the pseudorandom string generated by the most recent (per @javh) processing tool to touch the data before upload to the ADC.

What @bcorrie has been saying for some time is that we also need the equivalent of tool_cell_id, which has a uniqueness criteria within the DataProcessing context that it was created. That is, it is the ID of the Cell object as defined by the tool described in the DataProcessing. IMNSHO this is critical to data curation, data provenance, and reproducibility. I would prefer that this be in the AIRR Standard, but I am OK with it being an ADC API extension if others insist that it is not "AIRR Standard worthy" (with the appropriate reluctance, eye roll, and heavy sigh). It looks like @schristley and I agree that this is probably OK so we can move forward with that...

scharch commented 2 years ago

Wait @bcorrie are you ok with this, too, then?

Seems fine to me as long as cell_id stores the gid/pid and doesn't get redefined to be the "original" identifier (whatever "original" means).

If so, does that mean:

We've solved #347?
You are not interested in an Audit object (etc) otherwise?

bcorrie commented 2 years ago

Wait @bcorrie are you ok with this, too, then?

Seems fine to me as long as cell_id stores the gid/pid and doesn't get redefined to be the "original" identifier (whatever "original" means).

I think so - we have a different field to store the adc_tool_cell_id ("the original id") or whatever we call it as an ADC extension, so cell_id is your AIRR cell_id, a field whose uniqueness is AIRR context (file, analysis, ADC) specific.

If so, does that mean:

We've solved We need a more formal, fully-qualified identifiers for repository objects #347?

Not so sure, as this is just one identifier that we need to consider, but similar principles might apply

You are not interested in an Audit object (etc) otherwise?

I am interested, I just don't think it is required to resolve the cell_id issue...

javh commented 2 years ago

I am interested, I just don't think it is required to resolve the cell_id issue...

What is the plan to solve this issue with other _id fields? Eg, clone_id. Same thing? More extension fields?

schristley commented 2 years ago

We've solved We need a more formal, fully-qualified identifiers for repository objects #347?

Almost @scharch , we need to decide on some specifics. I suggested these two solutions. I think 1 is the way to go, but wanted to offer 2 in case somebody had compelling reasons why that was better. In particular, IEDB does 2 and the current GermlineSet suggests also, but neither feels "better" to me. We should pick one solution for all AIRR objects to follow.

Then we actually have to decide on the structure of the identifier itself. While we can say CURIE-like, we should be specific about what we actually mean, because we have to specify how the resolvers work.

bussec commented 2 years ago

DARN!! Looks like I missed the season finale! Did we already discover that @bcorrie and @schristley are actually the same person? And get to the point where that person -- after being implicated in the demolition of a good chunk of SRA's infrastructure -- needs to go into hiding and starts setting up Rogue AIRR Repos on Raspberry Pis along Interstate 84? I wonder what kind of global IDs you would need for that! What a cliffhanger!

schristley commented 2 years ago

DARN!! Looks like I missed the season finale!

A doozy for sure. Like every good season finale, it left us with a cliff hanger with Scott pondering life as a blockchain developer instead.

And as always they show us snippets of the next season, including:

Scott losing his life savings in a rug pull when he mistakenly buys doggecoin instead of dogecoin.
We also see Scott discovering a bug that the http DELETE method had not been disabled and then manages to delete all of the AIRR Data Commons. For completeness, he deletes all of the AIRR community repositories. When asked what happened, his only response is "oops!" He offers a trillion doggecoin (worth $2) as compensation.

Meanwhile:

sciReptor comes alive and rampages across Europe eating all of the laboratory mice. This leads Corey to exclaim, "well I guess we don't have sequence the loci of those mouse strains!" IMGT responds, "there are mouse strains?"

Coming to you all on the AIRR+ streaming network, for only $29.95 a month!

@javh edit: content.

javh commented 2 years ago

The good news is it looks like we've arrived at a solution! Which is to add some sort of ID history fields to the ADC Rearrangement extension for at least sequence_id and cell_id.

Because this is essentially a custom field solution, I don't have an opinion on implementation. Just some loose suggestions on future-proofing:

Do the same for clone_id.
Use a suffix other than _id to avoid confusion. Eg, _ref, _rid, _fd, _comment, etc. Dependent upon how we resolve persistence (#347) and whether you want this to be primarily an identifier in a file.

I assume we can table the Provenance and Audit ideas, yes?

@javh edit: content.

schristley commented 2 years ago

I've decided to completely disengage from the AIRR Community until further notice.

@javh edit: content.

javh commented 2 years ago

@schristley, understood. I'll follow up separately about whatever work needs to be done.

@javh edit: content.

scharch commented 1 year ago

closed for AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUGGGGGGGGGGGGGGGGGGGGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH

airr-community / airr-standards

How to approach identifier provenance #589