We need a more formal, fully-qualified identifiers for repository objects

schristley commented 4 years ago

This came up in a side discussions here and here. Creating a separate issue as those other issues are becoming overloaded with multiple topics.

The id fields we are defining in the AIRR Data Model aren't complete digital object identifiers required by FAIR when taken in context of the AIRR Data Commons because they don't indicate where that object is stored, i.e. they are missing the (F)indable attribute.

Here's what I believe are the key issues and requirements:

There are a key set of identifier fields for linking AIRR objects in the AIRR Data Model
There are two primary scopes for AIRR objects: 1) local analysis scope, and 2) ADC
We would like to define uniqueness criteria for these identifiers so tools can use data from both scopes without requiring special coding to handle those scopes.
For the local analysis scope, tools often aren't concerned with (or aware of) the larger context and might assign identifiers that are only unique in the local scope.
We would like the uniqueness criteria for objects in the ADC to be such that 1) there is no conflict in identifiers across different repositories and 2) the identifier can be used to resolve back to the specific object in the data repository.
F in FAIR says that (meta)data are assigned a globally unique and persistent identifier.
We can specify rules that apply uniformly to both scopes, or we can specify rules specific to each scope.

javh commented 2 years ago

So this is indeed the question and what we are discussing. Does the AIRR spec define cell_id as the cell_id that is recorded as part of an observation from a sequencing/measurement/data processing step? Or does the AIRR spec define the cell_id as a unique identifier that has nothing to do with such an observation/measurement (and therefore can be overwritten and changed by various tools based on their requirements)?

The latter. cell_id is defined as a unique identifier. It will be common for single-cell tools to use the cell barcode sequence extracted from the fastq files as the identifier, but there's no reason they have to use a barcode. It could just as easily be 1, 2, .. and still be a valid (locally) unique identifier.

Current definitions are:

Rearrangement
        sequence_id:
            type: string
            description: >
                Unique query sequence identifier for the Rearrangement. 
                Most often this will be the input sequence header or a substring thereof, 
                but may also be a custom identifier defined by the tool in cases where 
                query sequences have been combined in some fashion prior to alignment. 
                When downloaded from an AIRR Data Commons repository, this will usually 
                be a universally unique record locator for linking with other objects in the 
                AIRR Data Model.
        cell_id:
            type: string
            description: >
                Identifier defining the cell of origin for the query sequence.
            title: Cell index
            example: W06_046_091

In both cases, "identifier" is the important part. Looks like our example cell_id is even a plate/well location instead of a barcode, which isn't an observation (more like an inventory id).

Which is not to say we can't clarify these definitions, but all the _id fields are meant to be unique identifiers that can be used as links across objects. Users can fill the _id fields with observations if those observations also function as unique identifiers, and they often will, but it's not the purpose of the fields.

Edit: Lots.

schristley commented 2 years ago

Which is not to say we can't clarify these definitions, but all the _id fields are meant to be unique identifiers and not observations.

And there is nothing preventing tools from recording additional information that goes beyond or is tangential to AIRR, they just need to use custom fields for that, e.g. 10x_cell_label or 10x_cell_plate_well_barcode

bcorrie commented 2 years ago

The latter. cell_id is defined as a unique identifier.

The AIRR spec currently says nothing about cell_id's uniqueness as far as I can tell... 8-) It just says "Identifier defining the cell of origin for the query sequence". The example we have in the AIRR Spec suggests that it should should be documenting the value that comes out of the sequencing run from the experiment that produces it. In the 10X case, this would be the cell barcode (this is what 10X uses in their airr_rearrangements.tsv).

The scope of the currently implied uniqueness is extremely limited (a sequencing run). As far as I know there is nothing in the AIRR Spec that suggests otherwise. This is how we have been treating this field to date.

So what we are talking about, IMHO, is changing the current definition of cell_id as it is in the spec now.

bcorrie commented 2 years ago

Which is not to say we can't clarify these definitions, but all the _id fields are meant to be unique identifiers and not observations.

We have an x-airr attribute called identifier that identifies fields that have this property. It has been attached to some, but certainly not all of the _id fields. And for those fields that it applies to, the scope of that uniqueness is often limited. In some cases, such as cell_id currently, the identifier uniqueness scope is ill defined (mostly implied by the example). This is what I was trying to clarify here (without changing the definition): https://github.com/airr-community/airr-standards/pull/574#issuecomment-1026132979

As I have mentioned before, I think we need to think carefully about assuming these identifier fields can be changed by the tools (or repositories) that are processing data that have such fields.

schristley commented 2 years ago

So what we are talking about, IMHO, is changing the current definition of cell_id as it is in the spec now.

The current definitions of clone, cell and the other new objects are drafts, so I don't see this as problem. We've know for awhile that the uniqueness scope of the identifiers were not well specified and would need to be resolved.

In fact, you had a long engaged issue #246 about the uniqueness of the _id fields, and in one of your concluding comments pointed to the other issues that were handling this on a case-by-case basis (none which seem to be closed btw). Your quote is probably a good summary:

Even thought we haven't gone the full UUID and PID/DOI route yet on any of these, we haven't excluded them...

which gets back to my recent attempt to kickstart discussion. Though as usually, what I thought was the main issue has morphed into something else, so my initial idea about a solution isn't really so valid anymore.

javh commented 2 years ago

Uniqueness is implied by it being an identifier field. A non-unique identifier can't identify. But, we can/should clean up the wording. If we replace "Identifier" with "Unique identifier" in the cell_id definition that's a non-change in my book, because it's just clarifying the existing definition. Worth doing though.

IIRC, we've had a very similar conversation before, which is why the text of sequence_id is so verbose and includes specifics about the ADC. rearrangement_id was added to act, essentially, as a UUID in the ADC, but we decided that was redundant with sequence_id and that a proliferation of redundant identifiers was gross. So we dropped rearrangement_id and clarified the wording on sequence_id. The semantics were better on rearrangement_id, but sequence_id was already published and in use, so the rename wasn't worth the backwards compatibility break.

I have a hazy memory of the same conversation about preservation of the fasta/q sequence headers...

I don't see any way for data imports into the ADC to avoid being mildly destructive unless you want to maintain two complete records for the raw uploaded data and what is queryable. Should you keep Adaptive's non-standard gene names in v_call and add a v_call_gid with the corrected names? If we add sequence_gid and some tool puts in bad values for it, do we then need to add a sequence_omg_for_serious_gid field so we can preserve the tool's output? There's going to be some line that needs to be drawn about what needs to go into the ADC and what doesn't.

javh commented 2 years ago

Two little things:

In the 10X case, this would be the cell barcode (this is what 10X uses in their airr_rearrangements.tsv). The scope of the currently implied uniqueness is extremely limited (a sequencing run). As far as I know there is nothing in the AIRR Spec that suggests otherwise. This is how we have been treating this field to date.

That is now cellranger defines the scope because it processes one library at a time. Other tools may have a different scope (eg, DropletUtils:read10xCounts). So that scope limitation is specific to cellranger, not cell_id.

We have an x-airr attribute called identifier that identifies fields that have this property. It has been attached to some, but certainly not all of the _id fields.

Yeah, that's because x-airr.identifier = True implies a required field in the full AIRR Data Model. If there's an _id field not denoted at such, then it should be either because there's no relevant cross-reference in the schema or it's an oversight/typo in a draft schema.

schristley commented 2 years ago

I've updated the header comment with what I believe are the main issues and requirements for our discussion.

The last point is interesting:

We can specify rules that apply uniformly to both scopes, or we can specify rules specific to each scope.

My current opinion is we cannot have rules that apply uniformly to both scopes. Specifically because while tools in the local analysis scope can generate global unique identifiers, i.e. UUIDs; they cannot generate persistent identifiers. Persistence and thus the F in FAIR only really applies when the data is loaded into the ADC.

If that's true, then it follows that the ADC must alter the identifiers in some way to provide persistence.

scharch commented 2 years ago

My argument is that cell_id as it is in the current spec (as almost all fields in the Cell object are) is designed to capture the ID of the cell from the sequencing/processing pipeline. If a tool produces an ID for a cell in a data set, it should go in cell_id. Just like v_call records the V gene call that the annotation tool produces, and junction_aa records the Junction AA sequence.

@bcorrie This thought is completely bizarre and bewildering to me. Why on Earth would you compare cell_id to fields like v_call and junction_aa???? Those latter carry information that describes something about the object they are part of (to the point where we've gone and created a whole germline schema to make sure that v_call is fully resolvable). *_id fields don't contain any information --that cell over there isn't named "John Jacob Jingleheimer Schmidt," any more than "mAb114" conveys any useful information about a possible therapeutic antibody against Ebola. As @javh said, it's just an identifier. There's nothing special about the fact that the sequencing barcode of the GEM that this cell was found associated with happened to be AAAACCCCGGGGTTTT instead of AAAACCCCTTTTGGGG, and there's no special relationship between the cell associated with GEM barcode ACGTACGTACGTACGT and TGCATGCATGCATGCA compared to the one associated with TTTTGGGGCCCCAAAA. Quoting @javh again, they might as well be 1, 2, 3...

The scope of the currently implied uniqueness is extremely limited (a sequencing run). As far as I know there is nothing in the AIRR Spec that suggests otherwise. This is how we have been treating this field to date.

No. The scope is (and always has been, as far as I'm concerned) as broad as necessary for it to function as an identifier in the current context. For my local analysis, it's the project directory I'm working in. For the ADC, it's a GID.

We can specify rules that apply uniformly to both scopes, or we can specify rules specific to each scope.

My current opinion is we cannot have rules that apply uniformly to both scopes.

It's a false dichotomy.

If that's true, then it follows that the ADC must alter the identifiers in some way to provide persistence.

Yes.

As I have mentioned before, I think we need to think carefully about assuming these identifier fields can be changed by the tools (or repositories) that are processing data that have such fields.

No. The fact that they are identifier fields means that tool authors should be on notice that they are subject to change/reassignment by downstream tools/repositories, and we can write that into the definition of the field if you want. As @schristley pointed out, if there is actual information in there that needs to be preserved, it should be put in a custom field --and that can be written into the definition, as well.

scharch commented 2 years ago

We would like to define uniqueness criteria for these identifiers so tools can use data from both scopes without requiring special coding to handle those scopes.

@schristley, I don't think this is relevant. Precisely because of the next point (tools are lazy about uniqueness in the local context), tools are going to have to have ways to handle combining multiple local scopes. That's exactly why cellranger and SONAR already have this functionality...

schristley commented 2 years ago

We would like to define uniqueness criteria for these identifiers so tools can use data from both scopes without requiring special coding to handle those scopes.

@schristley, I don't think this is relevant. Precisely because of the next point (tools are lazy about uniqueness in the local context), tools are going to have to have ways to handle combining multiple local scopes. That's exactly why cellranger and SONAR already have this functionality...

Sure, but my point is there should be some guidelines to prevent conflicts or confusion, so maybe my bulletpoint is too imprecise.

For example repertoire_id and repertoire_group_id are both top-level objects and should be unique at the local "top-level", so a tool that assigns those identifies should conform to that, likewise a tool should be able to rely upon that (under most circumstances, like if the data comes from the ADC).

Furthermore, there is a hierarchy for some identifiers, like we say that subject_id and sample_id are unique within the same study_id. If we have that for other identifiers like clone_id and cell_id then we should be explicit what that is in the AIRR Data Model, so tools can conform and rely upon it.

bussec commented 2 years ago

After having been entertained by the discussion that took place here over the last weeks (and having run out of popcorn), here are my two cents:

I agree with @javh and @scharch in that IDs are just identifiers. Therefore they MUST in general be exchangeable (potential exceptions see 4.)
Coming back to my earlier comment on PIDs, I think that the ADC MUST insure that the criteria "globally unique" and "persistence" are met for its ALL its _id:
- "persistence": This is required if we consider the ADC to be a long-term repository infrastructure
- "globally unique": While this can be created with a hierarchical schema in which individual components do not need to be globally unique, I was wondering whether anything argues against using 128 bit UUIDs in general.
- criteria apply to all _id: IMO this approach is easier than defining different scopes for an ever growing number of _id.
The ADC SHOULD make _id also "resolvable" and thereby create full-fledged PIDs. Note that the "resolvable" criterion only requires that there is a generally know resolver, it does not mean that this resolver must be the DNS or the Handle system.
The only scenario that I could come up with in which IDs created during the primary analysis could not be simply replaced is when they are referenced by other data sets of a study, which are not loaded into the ADC. However, this case would IMO be appropriately handled with a Provenance object (as suggested by @javh) which contains the field name, the current and the previous _id.

javh commented 2 years ago

After reading through the Groundhog Day thread (#340), I think @bussec's summary reflects the consensus. My reading is that the main question to be agreed on in this thread is whether IDs should be persistent in the ADC context, rather than just globally unique and how to implement such identifiers. And all the ADC folks seem to agree on persistent.

So... I made a new issue for the identifier provenance question: #589

bcorrie commented 2 years ago

@bcorrie This thought is completely bizarre and bewildering to me. Why on Earth would you compare cell_id to fields like v_call and junction_aa????

Primarily because they are all produced by a tool, and any data that is produced by that tool across multiple files will use that field consistently across all of the files for that data (cell_id in 10X produced data are scattered across many files - and you can't really do anything without that linking _id field). So if I ever want to go back and try to understand something about the data in my repository by looking at the original data, I can. That is only possible if the tool produced cell_id is stored in the repository and can be referenced in the original data. In my opinion, this is critical in supporting researchers that are curating data in an ADC repository. You as the user/consumer may not care - but to the data steward/curator trying to manage the data provenance of the data in a repository they manage this is important.

I agree with @bussec in that there are a set of _id fields that the ADC should identify as having globally unique and/or PID characteristics. I do not agree that all fields with _id should have these characteristics...

My argument is that the since the cell_id is a field that is produced by many pipelines that process Cell data, maybe it isn't a great idea for us to use that field name as the field that contains a PID for a cell (which is what the ADC requries). We absolutely need a PID field, but I think it is a mistake to throw the tool generated linking field across these files away!

schristley commented 2 years ago

I'm seeing two potential solutions:

_id fields satisfy the both global uniqueness and persistence. This implies some CURIE-like value that provides both properties.
separate global uniqueness and persistence. The _id fields have global uniqueness, and separate _ref fields contain a persistent reference.

We've mainly been considering 1 but GermlineSet uses 2 in its draft. Here are pros/cons that I can think of:

PRO: 1 has less fields.
PRO: the CURIE-like structure of 1 almost guarantees global uniqueness.
CON: 1 requires a resolver, to interpret the value and translate into a URL. CURIE prefixes need to be stored in Schema; we would need to update whenever a new data repository (new prefix) is added to ADC. We might get around this by having an ADC registry.
PRO: 2 could use a resolver but it could simply be the direct URL, e.g., https://vdjserver.org/airr/v1/repertoire/1159043104164212245-242ac114-0001-012
CON: If 2 uses a CURIE-like resolver, it seems redundant; might as well just use 1.
PRO: Using a resolver allows flexibility in the data repository, i.e. hostnames can change, resolvers can be updated with new features, etc.
CON: Having a fixed URL in 2 provides less flexibility, in order to be persistent that host/API must always be available.
CON: 1 requires re-assigning the identifier values in the ADC, for example, a repertoire_id might be vdjserver:123. This could be heavy burden on data repositories as they might need to update all the records in the database (they could also do some translation on the fields during input/output)
CON: For 1, tools have no way to assign the persistent value, so the ADC would always need to overwrite the _id values.
PRO: For 2, tools that assign UUIDs, those UUIDs could potentially be kept when loading into the ADC.
CON: For 2, tools that don't assign UUIDs, the ADC would need to overwrite the _id values.

Any other pros/cons?

Regardless of 1 or 2, the ADC needs the ability to overwrite any local values assigned by tools when data is loaded into the ADC.

IMO, I'm leaning toward 1 at the moment. The main CON is it requires re-assigning identifier values in the ADC, but I think the flexibility of a CURIE-like resolver is a significant PRO.

scharch commented 2 years ago

Groundhog Day thread (#340)

😱

scharch commented 2 years ago

My argument is that the since the cell_id is a field that is produced by many pipelines that process Cell data, maybe it isn't a great idea for us to use that field name as the field that contains a PID for a cell (which is what the ADC requries). We absolutely need a PID field, but I think it is a mistake to throw the tool generated linking field across these files away!

But we already do this for sequence_id, and I can't see how cell_id (or clone_id or data_processing_id or...) is any different. I do understand the desire to be able to trace data back to its source, but the nature of the schema already limits this: sequence_ids can't really be traced back to raw fastqs without a lot of work to re-execute the DataProcessing, and even that assumes that the DataProcessing is actually complete/fully specified and the link out to SRA/etc is stable and correct. And in some cases (especially Tree generation), even a complete DataProcessing may not be deterministic...

In any case, what you're describing seems to be a "backend" ADC feature/use, so I don't think it should complicate end user-facing *_id fields. We've talked in the past about ADC-specific extensions to the schema, and a Provenance object seems like a good fit for that category...

bcorrie commented 2 years ago

@bcorrie This thought is completely bizarre and bewildering to me.

It also seems bizarre and bewildering to me that we are so adamant that we throw this information away! Why is there such a reluctance to having an extra field that captures this info as part of the standard? There is a very strong data curation use case to keep it, so I am also bewildered... 8-) The standard isn't just about analysis, but data reusability and data curation.

bcorrie commented 2 years ago

But we already do this for sequence_id, and I can't see how cell_id (or clone_id or data_processing_id or...) is any different.

Yep, and I argued strongly against that one too - but caved in because it was only one field...

scharch commented 2 years ago

It also seems bizarre and bewildering to me that we are so adamant that we throw this information away! Why is there such a reluctance to having an extra field that captures this info as part of the standard? There is a very strong data curation use case to keep it, so I am also bewildered... 8-) The standard isn't just about analysis, but data reusability and data curation.

Because there is no "information" there that is being discarded! And trying to preserve the original value of the field by adding a new field pollutes the schema without adding any analysis benefit in the ways that @javh and I have been arguing through (apparently) two entire threads now :-)

bcorrie commented 2 years ago

In any case, what you're describing seems to be a "backend" ADC feature/use, so I don't think it should complicate end user-facing *_id fields.

I don't agree - throwing away information that an annotation tool provides has nothing to do with the ADC. This is 100% a curation process issue.

scharch commented 2 years ago

So if I ever want to go back and try to understand something about the data in my repository by looking at the original data, I can. That is only possible if the tool produced cell_id is stored in the repository and can be referenced in the original data.

This implies that you are also storing the entire dataset in its original format somewhere accessible-but-outside-of-the-ADC?!? But isn't the point of the ADC to be the copy of record so that the original becomes irrelevant? Do you really have 2 copies of everything in iReceptor?

scharch commented 2 years ago

In any case, what you're describing seems to be a "backend" ADC feature/use, so I don't think it should complicate end user-facing *_id fields.

I don't agree - throwing away information that an annotation tool provides has nothing to do with the ADC. This is 100% a curation process issue.

It's not "information." Metadata, perhaps. And if curation isn't part of the ADC, then who are we doing this for? It's not part of the end-user data reuse process...

javh commented 2 years ago

The _id fields have global uniqueness, and separate _ref fields contain a persistent reference.

I think having _ref fields is fine, but I don't see them as a solution here. If a _ref is a foreign key / citation when uploaded, there's no guarantee that it's going to remain a valid reference in the future and it can't be trusted as a linking identifier in the ADC. If I'm understanding the _ref field correctly, then it's really just a more formal comment string.

schristley commented 2 years ago

The _id fields have global uniqueness, and separate _ref fields contain a persistent reference.

I think having _ref fields is fine, but I don't see them as a solution here. If a _ref is a foreign key / citation when uploaded, there's no guarantee that it's going to remain a valid reference in the future and it can't be trusted as a linking identifier in the ADC. If I'm understanding the _ref field correctly, then it's really just a more formal comment string.

Call them _pid fields if it helps; they should contain whatever is necessary for persistent access to the object. The point being that the persistence attribute is separated from the global uniqueness attribute. Regardless, I don't think they need to be separated, but I offered it as an alternative solution in case somebody thought of some PROs.

javh edit: Sorry @schristley, I accidentally edited this instead of quoting (I don't know how). Should be restored now.

bcorrie commented 2 years ago

So if I ever want to go back and try to understand something about the data in my repository by looking at the original data, I can. That is only possible if the tool produced cell_id is stored in the repository and can be referenced in the original data.

This implies that you are also storing the entire dataset in its original format somewhere accessible-but-outside-of-the-ADC?!? But isn't the point of the ADC to be the copy of record so that the original becomes irrelevant? Do you really have 2 copies of everything in iReceptor?

Nope, but we want to support reproducibility where ever we can... So no data in the pipeline is ever really irrelevant.

The point of the ADC is data sharing, data reuse, and reproducibility. I would argue that is also the point of the AIRR Standard as well. The AIRR Standard points to source records of information throughout. SRA files (RawSequenceData object), INSDC Bioproject information (study_id), the files in which the annotated data came from (data_processing_files), etc. These are critical for reproducibility. We don't store everything, but we try to make it possible to reproduce everything...

Curation is part of this entire process - it is not specific to the ADC. If you describe a study using the AIRR Standard, you are curating data according to the AIRR Standard.

If you want to be truly reproducible, at any point in the processing pipeline, I should be able to use the AIRR Standard to go from one processing step to another processing step, and able to reproduce where a piece of data came from.

Here is my curator use case. I don't need to be using the ADC for this, this could be using studies curated for analysis and stored completely on disk using the AIRR format files for repertoire, rearrangement, cell, clone, etc.

As a data curator if I want to confirm that data in my AIRR files (or my ADC repository) is correct, I SHOULD be able go back to my source files and confirm this is indeed the case. When I lost sequence_id, I lost the ability to do that for the original fastq files - damn, but hey it is only the sequence that we are talking about, and we have millions 8-)

But now we are talking about cells, which have complicated linkages across rearrangements, clones, cells, and gex data. In the case of annotation tools, these linkages are across many files. So when I process some 10X studies (N samples from one study and M samples from another study) generating AIRR compliant files in preparation for analysis, I replace the source 10X cell_id with a unique AIRR cell_id to make sure cell_id is unique across my analysis of interest.

Now I want to confirm that the data I just processed for a certain 10X cell_id (TACGGATGTACACCGC-1) from a single subject in my source data is correct across the data I am going to use for my analysis. I can't...

Similarly, if I want to look at an AIRR unique cell_id in my processed data and then find the source information in the original 10X produced data files. Again, I can't...

So we have broken the link between the data in the AIRR compliant files to the original source data - data/"information" can no longer be mapped between the two...

Now if you truly trust the tools that do all of that processing, then maybe you don't want to do any provenance or reproducibility checks... But that is not how I would do things 8-)

Here is an example of what you get from a repository with our current implementation. If I maintain the annotation tool cell_id in some form, I can cross check the validity of the data I loaded with the original 10X files. If I don't, I can't... If you are a data steward maintaining an ADC repository, this is an important step...

Basically I want to be able to ensure that cell_id_annotation_tool = TACGGATGTACACCGC-1 links the correct data in the original 10X files (ERS1-TRA.tsv, ERS1-vdj_t_gex.json, ERS1-vdj_t-cells.json) that I as the data curator have maintained...


 curl -d '{"fields":["study.study_id","sample.sample_id", "sample.sequencing_files.filename", "data_processing.data_processing_files"]}' http://single-cell.ireceptor.org/airr/v1/repertoire

[Some stuff deleted/edited]

{
            "study": {
                "study_id": "PRJCA002413"
            },
            "sample": [
                {
                    "sample_id": "ERS1",
                    "sequencing_files": {
                        "filename": "CRR126571_f1.fastq.gz, CRR126572_f1.fastq.gz, CRR126573_f1.fastq.gz, CRR126574_f1.fastq.gz"
                    }
                }
            ],
            "data_processing": [
                {
                    "data_processing_files": [
                        "ERS1-TRA.tsv"
                    ]
                }
            ]
},
{
            "study": {
                "study_id": "PRJCA002413"
            },
            "sample": [
                {
                    "sample_id": "ERS1",
                    "sequencing_files": {
                        "filename": "CRR126563_f1.fastq.gz, CRR126564_f1.fastq.gz, CRR126565_f1.fastq.gz, CRR126566_f1.fastq.gz"
                    }
                }
            ],
            "data_processing": [
                {
                    "data_processing_files": [
                        "ERS1-vdj_b_gex.json",
                        "ERS1-vdj_b-cells.json",
                        "ERS1-vdj_t_gex.json",
                        "ERS1-vdj_t-cells.json"
                    ]
                }
            ]
}

scharch commented 2 years ago

@bcorrie I am happy to stipulate to the importance of being able to trace the provenance of piece of data. But I am going to respond to the rest in the new Provenance object thread (#589) so that we don't crush all of @javh's hopes and dreams...

schristley commented 2 years ago

In any case, what you're describing seems to be a "backend" ADC feature/use, so I don't think it should complicate end user-facing *_id fields.

I don't agree - throwing away information that an annotation tool provides has nothing to do with the ADC. This is 100% a curation process issue.

@bcorrie We don't have to keep going round and round this in this issue. I brought up the issue initially, and I was happy with doing a custom solution, but you'd like something more formal, which is fine. That's been recognized with #589 and we can discuss solutions there. Let's get this issue back onto its main track of FAIR for ADC objects.

javh commented 2 years ago

@schristley

Call them _pid fields if it helps; they should contain whatever is necessary for persistent access to the object.

I don't think it helps. At least, not as I'm interpreting it. The _ref being foreign is the rub. Which, I think is fine as metadata, but won't work as an ID in the ADC because you can't update the foreign record (eg, to fix v_call, remove sequencing adapters, or whatever).

I guess the question is whether that's a problem.

schristley commented 2 years ago

@schristley

Call them _pid fields if it helps; they should contain whatever is necessary for persistent access to the object.

I don't think it helps. At least, not as I'm interpreting it. The _ref being foreign is the rub. Which, I think is fine as metadata, but won't work as an ID in the ADC because you can't update the foreign record (eg, to fix v_call, remove sequencing adapters, or whatever).

I'm not sure what you mean by "foreign". If you are thinking "foreign key", that's not what is meant. I also don't understand how "update the foreign record" matters. This is persistent access to a read-only object.

According to FAIR, (meta)data are assigned a globally unique and persistent identifier. There isn't the requirement that these two attributes are satisfied by a single field. For example, IEDB splits them into two fields, one which is the identifier (which doesn't look globally unique but is because IEDB is a central database), and another which is the IRI for persistence.

Reference ID | Reference IRI | Epitope ID | Epitope IRI
-- | -- | -- | --
1004580 | http://www.iedb.org/reference/1004580 | 16878 | http://www.iedb.org/epitope/16878

javh commented 2 years ago

@schristley, Ah, I see... maybe. I'm getting my signals crossed here. I was thinking of _ref as described in the Germline schema and discussed in the last call. Which is, for example, the GenBank accession providing evidence for a novel allele, so, yes, a foreign key.

The _ref you're describing is the _pid field we've been discussing in this thread, except that it is not being used as the ADC linking identifier. Correct?

schristley commented 2 years ago

@schristley, Ah, I see... maybe. I'm getting my signals crossed here. I was thinking of _ref as described in the Germline schema and discussed in the last call. Which is, for example, the GenBank accession providing evidence for a novel allele, so, yes, a foreign key.

The _ref you're describing is the _pid field we've been discussing in this thread, except that it is not being used as the ADC identifier. Correct?

Right. Sorry, I was mentioning _ref in terms of germline_set_ref which is essentially a persistent IRI that is separate from the identifier germline_set_id, and not the references to foreign records.

IMO, germline_set_ref satisifies both the global uniqueness and persistence, so there really isn't a need for two fields...

CON: If 2 uses a CURIE-like resolver, it seems redundant; might as well just use 1.

schristley commented 2 years ago

I just thought of another major CON for doing 2 instead of 1.

CON: With 2, all references to an object must include both fields because the _id isn't sufficient to resolve the object.

For example, say I had a rearrangement record that references a clone_id, but the Clone data is not provided as part of the data set. The clone_id is insufficient to get the clone data, I would also need clone_pid (or clone_ref) so that I could resolve and download the object. This implies that in all of the AIRR objects, we would need both fields, creating a lot of additional fields to be maintained.

schristley commented 2 years ago

Thinking about the actual content of the identifier, if we go with a CURIE-like structure, where we need a resolver, we can support decentralized identifiers later on, if we want. It would just involve extending the resolver code. We can support both and repositories can pick the one they want to implement.

The other thing is whether a type is needed as part of the identifier:

repository:type:code

vdjserver:repertoire:124
vdjserver:germline_set:145
vdjserver:clone:567

But this maybe isn't needed? The reason is the field, repertoire_id, germline_set_id, clone_id, etc., is essentially defining the type. If the identifier is in repertoire_id then we know it's a repertoire, if it is in clone_id then we know it's a clone, and so on. In the AIRR schema, we don't mix and match identifier types in the same field, nor do we have generic fields. Those this means resolving requires knowing the context (field) of the identifier, if somebody just gave you a value vdjserver:124, it couldn't be resolved properly by itself. Maybe this goes against the identifier being "self-contained"?

Another point is that the complete value is the identifier value, so an ADC API call for that specific repertoire_id would be

https://vdjserver.org/airr/v1/repertoire/vdjserver:124

Likewise, when sending a POST query

{
    "filters":{
                "op":"in",
                "content": {
                    "field":"repertoire_id",
                    "value":[
                        "vdjserver:2366080924918616551-242ac11c-0001-012",
                        "vdjserver:2541616238306136551-242ac11c-0001-012",
                        "vdjserver:1993707260355416551-242ac11c-0001-012",
                        "vdjserver:1841923116114776551-242ac11c-0001-012"
                    ]
                }
    }
}

If this wasn't the case, that is, if just the trailing code (or number) was the identifier, users would have to constantly parse the value to pull out the appropriate bits.

This also mean that our CURIE-like resolver cannot manipulate the identifier in any way, which is done for some ontology fields. If the identifier values change, for queries, for data returned from the ADC, etc., then it fails at being an identifier and objects cannot be linked.

bcorrie commented 7 months ago

I am thinking that this issue is probably not going to be resolved in v2.0 (and doesn't need to be resolved in 2.0). Moving this to v2.1.

schristley commented 6 months ago

@bcorrie In some sense, I think we are making this issue more complicated than it needs to be, at least in the context of the ADC. All we need to do is make these identifiers (in the ADC) be CURIEs. The prefix part points to the global service, i.e. the ADC repository, and the local identifier part can be whatever that is interpreted by the ADC repository. I think that James' presentation of LinkML and his discussion of CURIEs shows that it works quite well for creating globally unique identifiers that can be resolved and be FAIR.

AKC is going to need them. The question is do we implement them first in the data integration scripts (ADC --> AKC) as a test then port them back into the ADC, or just put them in the ADC first?

airr-community / airr-standards

We need a more formal, fully-qualified identifiers for repository objects #347