We need a more formal, fully-qualified identifiers for repository objects

schristley commented 4 years ago

This came up in a side discussions here and here. Creating a separate issue as those other issues are becoming overloaded with multiple topics.

The id fields we are defining in the AIRR Data Model aren't complete digital object identifiers required by FAIR when taken in context of the AIRR Data Commons because they don't indicate where that object is stored, i.e. they are missing the (F)indable attribute.

Here's what I believe are the key issues and requirements:

There are a key set of identifier fields for linking AIRR objects in the AIRR Data Model
There are two primary scopes for AIRR objects: 1) local analysis scope, and 2) ADC
We would like to define uniqueness criteria for these identifiers so tools can use data from both scopes without requiring special coding to handle those scopes.
For the local analysis scope, tools often aren't concerned with (or aware of) the larger context and might assign identifiers that are only unique in the local scope.
We would like the uniqueness criteria for objects in the ADC to be such that 1) there is no conflict in identifiers across different repositories and 2) the identifier can be used to resolve back to the specific object in the data repository.
F in FAIR says that (meta)data are assigned a globally unique and persistent identifier.
We can specify rules that apply uniformly to both scopes, or we can specify rules specific to each scope.

bcorrie commented 4 years ago

Just to clarify, are we talking formal DOI as in: https://www.doi.org/

I think at a minimum an AIRR compliant repository should have a formal DOI.

Beyond that, I am not sure how far down the DOI path we should go... It makes some sense to me that the study data for a specific study in a specific repository could use a DOI, but given that most studies, through their publication, would already have a DOI, this might be overkill. If we added a study_doi field to the metadata (for the publication DOI), that might cover it. If referring to the data in a specific study in the AIRR Data Commons, the combination of the Repository DOI and the Study DOI (findable in the study_doi field) would suffice.

My gut feeling is that going down the DOI path much further than that might be overkill (formal DOI generation requires a DOI provider), but certainly we could and possibly should use UUID as per https://tools.ietf.org/html/rfc4122.html for internal object to provide a uniqueness criteria. They are easy to generate in that many languages have libraries that generate them...

I would also note that the study_id field can be considered a unique identifier if the definition in the spec is followed as per Unique ID assigned by study registry assuming the study registries assign non-overlapping IDs.

schristley commented 4 years ago

Just to clarify, are we talking formal DOI as in: https://www.doi.org/

No, just DOI in the context of the FAIR standard, which doesn't require the doi.org service to be used. The FAIR paper defines DOI this way:

DOI—Digital Object Identifier; a code used to permanently and stably identify (usually digital) objects. DOIs provide a standard mechanism for retrieval of metadata about the object, and generally a means to access the data object itself.

So I think using a URL (https://vdjserver.org/airr/v1/repertoire/abc) to access the data object itself would be sufficient.

I think at a minimum an AIRR compliant repository should have a formal DOI.

I think this might be worthwhile, but we should probably lump this into the discussion about the "registry" which the CRWG hasn't really defined yet...

My gut feeling is that going down the DOI path much further than that might be overkill.

Me too, so this isn't about digging a deeper hole. It's about how we are going insure that when you get an AIRR file (off the web, sent in email, supplemental file with an article, etc.), that you can back to the original object in the data repository.

I think of this as a provenance issue, but it is also a practical issue. I may give you a Cell file but say, use the DOIs in that file to download the rearrangements. Right now, the standard _ids aren't enough sufficient to do a http request and download the data.

bussec commented 4 years ago

No, just DOI in the context of the FAIR standard, which doesn't require the doi.org service to be used.

To my understanding there is only one type of DOI and that's the one governed by doi.org. I agree that the Wilkinson et al. describe it as it would be a generic term, but IMO it's not. Out of curiosity I just checked on the costs, and at 0.06 USD per DOI it would be feasible to create DOIs at least for study objects (fees can be found at https://www.crossref.org/fees/ ).

The advantage of a DOI vs an UUID is that it is clear how to resolve it. However, I don't know whether the record it resolves to is clearly defined. IMO it would not hurt if the data of a study has a separate DOI then the publication located at a publisher's site.

But I agree that we don't want to create DOIs for each single Rearrangement :-)

bcorrie commented 4 years ago

Is this something we need to resolve for ADC API v1?

bussec commented 4 years ago

Summarizing a discussion that @schristley, @bcorrie and me had via mail. Will probably not require any direct action, just putting it here for future reference:

The generic term for the feature we are looking for is "persistent identifier" (PID), of which the DOI would be a specific implementation. EOSC has an own sub-working group to address PID usage, who recently published a document on it [DOI:10.5281/zenodo.3574203]. In the document a PID is defined as:

globally unique
persistent
resolvable

The question is whether we really need all these feature for all AIRR objects, i.e., how far would we go with PIDs, when would (non-resolvable) UUIDs come in handy, and where do we only need local uniqueness?

The four main levels that a PID could be applied to are:

Repository: It was already suggested by @bcorrie that AIRR-compliant repositories should have PIDs. iReceptor Public Archive and VDJServer already have DOIs through fairsharing.org. Whether this is a recommendation (to enhance citability) or a requirement (mandatory ID within a standard) is up for discussion. This will also depend on the way how (and whether) the subsequent PIDs will be resolved.
Study: In most cases there will be a DOI assigned to the related publication, however we need to keep in mind that this refers to a different object class, i.e., scholarly communication instead of the actual data set. This is usually sufficient for a human curator to quickly find the associated data sets, but this might not be true for a computer. BioProject IDs referring to a study can be considered to be PIDs (as the resolver is broadly known), but will refer to a record at INSDC, not in an AIRR repository.
Repertoire: While this is also a good candidate for a PID there are a couple of points to consider:
1. We clearly need PIDs for data sets of a given study. However, whether we need them at the level of Sample or on the level of Repertoire needs further discussion.
2. Repertoire currently does not have a strict definition (also see #361) and Repertoire objects can be generated dynamically during queries. This could lead to inflation of PIDs and it is questionable whether there is any added value to this.
3. As this is an AIRR-specific object, we need to find ways to mint and administer these PIDs.
Rearrangement: Most certainly not, as PIDs usually refer to a data set (e.g., a table), not the individual datum (e.g. a line within the table). Furthermore it currently hard to see a use case for this as long as we have the possibility to create arbitrary and non-overlapping sets of rearrangements, i.e., repertoires.

schristley commented 4 years ago

Is this something we need to resolve for ADC API v1?

probably not, we mainly need for the new (experimental) objects like Clone, Cell and etc., so we can resolve in concert with their release.

schristley commented 3 years ago

I've reviewed the W3C standard for decentralized identifiers, and it looks like it will work quite well for our purposes. I'm considering this standard just for the identifiers in the AIRR Data Model used to reference AIRR objects, external identifiers outside our control are handled with #464

A decentralized identifier (DID), has a simple syntax consisting of three parts, a colon separates the three parts:

did:example:identifier

where did is static and defines this as a decentralized identifier, example is called the DID method, and identifier is called the DID method-specific identifier. The DID spec places few limitations on the identifier part; we can even have additional colons in it if we want.

The DID method is the key part. It is somewhat analogous to the first part of a CURIE. It's creating the unique namespace for the identifiers. Also, according to the spec, "a DID method defines how implementers can realize the features described by this specification". We need to define a DID method and SHOULD register it with DID Registry. So in an odd twist, creating a decentralized identifier suggests registering in a central repository namespace... though it's not mandatory.

Anyways, my suggestion is we define and register the airr DID method. That is, all AIRR DIDs look like this:

did:airr:identifier

The DID spec talks a lot about verification, security, and etc., but all of those capabilities are optional. The DID method must implement a number of functions for DID resolution and URL dereferencing, though the spec leaves it almost completely open for how the DID method does that. Conceptually I find this very similar to how we are resolving CURIE identifiers, and I believe we can implement much the same for DIDs.

What's left for us to consider is how to define the DID method-specific identifier. There is no requirement in the DID spec that the TYPE of resource, which the DID references, must be the same. So we could do something simple with just numbers, like this, but this doesn't provide us enough flexibility as we want identifiers to resolve to difference repositories in the ADC.

did:airr:123
did:airr:124
did:airr:567

With the DID method as airr, that provides a global AIRR namespace, and it's up to us if we want to impose additional structure and sub-namespaces to it. My suggestion is that we define an additional repository sub-namespace:

did:airr:repository:identifer

Then DIDs look like this:

did:airr:ipa:123
did:airr:vdjserver:124
did:airr:orgrdb:567

but even this isn't quite complete, because does did:airr:vdjserver:124 refer to a Repertoire, a Rearrangement, or another AIRR object? If we want to do full URL dereferencing, like we are doing with CURIE, we need to know the type to know the proper ADC API end point to hit. Here's where I think we have options, one simple idea is to add another sub-namespace level that defines the type.

did:airr:repository:type:identifer

did:airr:ipa:repertoire:123
did:airr:vdjserver:repertoire:124
did:airr:vdjserver:germline_set:124
did:airr:orgrdb:germline_set:567

But it's equally valid to combine those two sub-namespaces together into one like so, we have complete control over how the format.

did:airr:repository_and_type:identifer

did:airr:ipa_repertoire:123
did:airr:vdjserver_repertoire:124
did:airr:vdjserver_germline_set:124
did:airr:orgrdb_germline_set:567

Currently, I prefer the first option with two namespaces.

Hopefully now we can see how DIDs can be implemented. Like CURIEs, we have a resolution table in the AIRR schema that define how ipa, vdjserver and ogrdb can be de-referenced into URLs.

did:airr:vdjserver:repertoire:124
==>
https://vdjserver.org/airr/v1/repertoire/124

bussec commented 3 years ago

In case we decide against decentralized identifiers, the URN service run by GEANT might be potential way to be able to coin PIDs without having to run the registry:

https://tools.ietf.org/html/rfc4926 https://wiki.geant.org/display/URN/Registry+Home

schristley commented 2 years ago

The dual usage/requirements for identifiers that link/reference AIRR objects within the AIRR Data Model continue to bite us. The dual usage being:

user-defined (or tool-assigned) identifier values that are locally consistent within files for running analysis tools and such.
globally unique identifier values for ADC objects that are FAIR.

For sequence_id, we agreed that the ADC can overwrite the identifier value with their own to provide a PID for the rearrangement record in the repository. This really isn't optimal because there are efforts around analysis reproducibility where we'd like to backtrack to the original sequence in a raw data file, and thus want that original sequence_id. This becomes more problematic with fields like cell_id and clone_id where annotations tools generate those ids and use them throughout multiple file/records. Overwriting the identifier in the ADC loses any linkage with data stored outside ADC. Thus, I believe we really need to consider maintaining original identifier values (like we do with subject_id, sample_id) as separate from ADC PIDs.

An easy idea is to separate the fields, i.e. have *_pid fields which hold the ADC PID while the *_id fields hold the original user-defined (tool assigned) value. Yes, it creates more fields but at least the semantics for each are clear and precise. This seems like it can work, except for one key problem. What do tools do?

Today, tools assume that *_id are unique within the local context of data files, but there is a scenario where that breaks down: downloading multiple studies from the ADC, and then combining the multiple study data together. There's no guarantee that a clone_id or *_id from one study doesn't conflict with the *_id values from another study. However, if the tools used the *_pid fields instead then they got uniqueness. But that complicates tool logic, every time they want to use an identifier, they have to decide should they use the *_pid or the *_id fields.

One could make the argument, like with subject_id and sample_id, that the uniqueness is only guaranteed with the study and/or within the Repertoire, and thus tools need to use compound keys, e.g. study_id, repertoire_id, data_processing_id, clone_id. Unfortunately, this doesn't completely solve the problem as the top-level objects (Repertoire, RepertoireGroup, DataProcessing) can still have conflicts. That is, if a tool assigns repertoire_id for local files, and the ADC assigns repertoire_pid, there's no guarantee repertoire_id is unique across studies, so we are back to the same problem, though maybe now it's less # of fields to consider, i.e. only the top-level AIRR objects.

Another idea is to not have *_pid fields, but instead when the data is loaded into the ADC, the *_id fields are assigned the PID, and the original user-defined value is put in a *_original_id field. So it's still the idea of having separate fields. In this case though, tools can continue to assume *_id are unique, and the problem of combining data from multiple studies from the ADC is solved. The exception is if tools want to link ADC data with external data, it will have to know to use the *_original_id fields.

I don't see a solution that doesn't requiring having separate fields if we want to store both the original identifier value and the ADC PID. Any other ideas?

To summarize:

Have separate *_id and *_pid fields. Tools will need logic to use one or the other to do lookups.
Have separate *_id and *_pid fields, but rely upon the compound nature of the *_id to define scope, e.g. clone_id is only unique with a repertoire, study and data processing. Tools would need to use that compound nature when doing lookups. There would still be potential conflict for top-level AIRR objects, so tools will need logic to use *_id or *_pid for them.
Have separate *_id and *_original_id fields. ADC can overwrite *_id with PID and puts the original value in *_original_id. Tools can assume *_id is unique and links ADC objects, when linking to external non-ADC data, the *_original_id fields need to be used.

bussec commented 2 years ago

My 2 cents on this:

I agree with the general idea of having separate fields
To maintain the original value is an important but less frequent use case. In addition, using the original identifiers for linkage to non-ADC data might be subject to further ambiguity (see below). Therefore I would give preference to an *_id/*_original_id solution.
Compound IDs that rely on understanding the structure of the AIRR Schema seem complex and potentially error-prone to me.
Are we sure that there will always be only a single original ID to store?

javh commented 2 years ago

*_id and *_original_id makes the most sense to me as well. But, this seems like a pretty specific use case that might be a job for a custom field.

Can I throw out a 4th option? What about some sort of provenance object to store these relationships? It'd be essentially the same thing as *_original_id, but stored in a separate table. It would also lend itself to lists of original identifiers, links to DataProcessing for how the change was made, etc.

bcorrie commented 2 years ago

The problem with the _original_id concept is that the field names in the source files will be of the _id form. 10X produces data with clone_id and cell_id in their files (same with Immcantation etc, no), and these naturally map to the field names in the spec. It seems more natural to me to maintain those fields as they are in the original data from the annotation tool and to have a new, specific field that more explicitly states what it is. For example, it it is truly a persistent ID as per the FAIR PID definition, then maybe it should be _pid but it it is "just" a CURIE that turns the ID in the repository into something that is globally unique maybe it should be a _gid for global identifier?

bcorrie commented 2 years ago

I think the question of having globally unique identifiers for objects in ADC repositories and managing provenance and how such globally unique objects are related to each other are two different topics, no?

bcorrie commented 2 years ago

did:airr:repository:type:identifer did:airr:vdjserver:repertoire:124 ==> https://vdjserver.org/airr/v1/repertoire/124

BTW, I like this structure because the vdjserver part maps directly to the servers objects in the OpenAPI 3.0 spec. In that way, the DID structure and the OpenAPI server and path objects map nicely.

schristley commented 2 years ago

Are we sure that there will always be only a single original ID to store?

Maybe yes, if a program creates a compound ID with multiple fields that are unknown or with a different scope from AIRR than those files won't work properly with AIRR tools. That may be okay so long as the program doesn't say it is AIRR TSV compliant, that is the file is custom to that program. I think it's still feasible to use the same technique, but you have to know that compound ID existed and thus save the other fields in a *_original_id as well.

schristley commented 2 years ago

*_id and *_original_id makes the most sense to me as well. But, this seems like a pretty specific use case that might be a job for a custom field.

I'm not sure I understand what you mean by a custom field? I guess I was thinking of *_original_id as custom fields, just with defined names, so you could systematically recover the original values.

schristley commented 2 years ago

To maintain the original value is an important but less frequent use case.

I tend to agree. As I would expect that when data is loaded into the ADC, any additional data files get essentially lost, they are local files on somebody's hard drive and not accessible to the public. I would be tempted to just say, don't keep the original values, except that I know one exception, which is VDJServer does keep extra files from its jobs, and those files can be made publicly accessible (and with a DataProcessing identifier so they can be linked with the ADC data).

Now, I'm also okay with not defining *_original_id as part of the AIRR schema, and just do something specific for VDJServer. Maybe that's what Jason was referring to about custom fields? If we think this is a rare use case, and even just specific to VDJServer, maybe we don't define *_original_id?

bcorrie commented 2 years ago

One high level comment, before CRWG tomorrow...

I feel that _id fields in the AIRR spec should not be changed . They are typically either provided by a data curator (e.g. study_id, subject_id) or produced by an analysis/annotation tool (e.g. cell_id and clone_id). In these later cases, the _id fields are used to reference entities in other AIRR spec objects (cell_id in Rearrangement links to cell_id in Cell). These already have uniqueness criteria defined for them (or they should) relative to each other. It seems wrong, confusing, and dangerous to go around changing such IDs because we want to combine two data sets for some purpose.

If we need IDs that have a broader uniqueness criteria such as an ID field that is either persistent or globally unique (such as that required by the ADC), then it seems to me that we should define a specific ADC field that encompasses that uniqueness criteria.

I think we should bite the bullet now and make these globally unique identifiers so we don't face this problem again down the road. Maybe even having the ability for those globally unique identifier to be persistent identifiers. That way they are unique across the ADC, but are also unique across any analyses that one might want to do on any type of AIRR-seq data. If tools need to combine data sets from disparate sources they don't have to worry about munging _id fields to make their analyses work, they know to use the globally unique identifiers. AIRR compliant software tools would use such global identifiers when processing AIRR-seq data - NOT the potentially conflicting local IDs that are created by either a data curator or the local run of an annotation or analysis tool.

I suspect it would be quite easy to write some utility tools (the AIRR libraries) to add such globally unique identifiers to an AIRR data file, once we decide on which such objects required global identifiers...

javh commented 2 years ago

@schristley:

Now, I'm also okay with not defining *_original_id as part of the AIRR schema, and just do something specific for VDJServer. Maybe that's what Jason was referring to about custom fields? If we think this is a rare use case, and even just specific to VDJServer, maybe we don't define *_original_id?

Yeah, that's what I meant by "custom field".

@bcorrie:

I think we should bite the bullet now and make these globally unique identifiers so we don't face this problem again down the road. Maybe even having the ability for those globally unique identifier to be persistent identifiers. That way they are unique across the ADC, but are also unique across any analyses that one might want to do on any type of AIRR-seq data. If tools need to combine data sets from disparate sources they don't have to worry about munging _id fields to make their analyses work, they know to use the globally unique identifiers. AIRR compliant software tools would use such global identifiers when processing AIRR-seq data - NOT the potentially conflicting local IDs that are created by either a data curator or the local run of an annotation or analysis tool.

That would be a huge change, as the purpose of the current *_id field are to act as row keys. So we'd be asking tool developers to change their implementations by redefining the current *_id fields as not-keys and then adding a new key field. I can't envision researchers complying fully with globally unique identifiers, so you'll still have to verify/munge incoming *_pid values if we add these fields. Yes?

I can see the value in maintaining a mapping to the original ids for provenance and revisiting some QC/biology questions (eg, retaining barcode sequences). But, what's the danger in changing the values in the *_id fields? Making sure all these cross-references stay consistent? Would that need to be done regardless?

schristley commented 2 years ago

One high level comment, before CRWG tomorrow...

I feel that _id fields in the AIRR spec should not be changed .

Don't muddle the intent of the _id fields. Those fields were ALWAYS designed to be the linkage between AIRR objects in the AIRR Data Model. It was never necessary that some identifiers, created by tools, have to be "wedged" into those fields. Remember, the flow goes one way, AIRR defines a standard, and the tools conform. A tool can produce it's own file format that conforms to its own specifications, but if the tool produces an AIRR format file, it has to conform to the AIRR Standards.

The exception being a few fields like subject_id and so on, which have been encapsulated by repertoire_id

These already have uniqueness criteria defined for them (or they should) relative to each other.

I would argue that except for repertoire_id (and maybe partially sequence_id), the other AIRR _id fields have not had their uniqueness criteria defined, except in draft form (which we are presumably allowed to change because its not a published standard).

I think we should bite the bullet now and make these globally unique identifiers so we don't face this problem again down the road.

I agree.

schristley commented 2 years ago

@schristley:

Now, I'm also okay with not defining *_original_id as part of the AIRR schema, and just do something specific for VDJServer. Maybe that's what Jason was referring to about custom fields? If we think this is a rare use case, and even just specific to VDJServer, maybe we don't define *_original_id?

Yeah, that's what I meant by "custom field".

Okay, cool, I'm totally good with that. AIRR doesn't define *_original_id fields; I can do something custom for VDJServer if I like. That simplifies things a lot because we can toss out the general idea of having separate fields. We can just have *_id and allow the ADC to overwrite local identifier values with GIDs or PIDs as with sequence_id.

schristley commented 2 years ago

I can see the value in maintaining a mapping to the original ids for provenance and revisiting some QC/biology questions (eg, retaining barcode sequences). But, what's the danger in changing the values in the *_id fields? Making sure all these cross-references stay consistent? Would that need to be done regardless?

Just to be more explicit about what I'm suggesting. Tools that create AIRR format files can assign values to the AIRR _id fields such that those identifiers are unique within a local context. Tools don't have to automatically create GIDs or PIDs, but they can if they want.

When that data is loaded into the ADC, the data repository is allowed to (and should) overwrite the AIRR _id fields with values that are GIDs or PIDs. Ensuring that the cross-references stay consistent and that those identifiers won't conflict with data in other repositories of the ADC. I personally would prefer they also be PIDs, with the ability to be resolved back to that object in the AIRR data repository.

schristley commented 2 years ago

did:airr:repository:type:identifer did:airr:vdjserver:repertoire:124 ==> https://vdjserver.org/airr/v1/repertoire/124

BTW, I like this structure because the vdjserver part maps directly to the servers objects in the OpenAPI 3.0 spec. In that way, the DID structure and the OpenAPI server and path objects map nicely.

I should make the point that the resolution process I described isn't really how DID specifies it should work. What I described it more like a CURIE hack. I wrote up more detail about what DID really does in #563 .

The important thing to get out of DID is that you (code, tool, etc.) do not construct a URL like with CURIE to resolve to the object. Instead, it is a 2-step process. You (code, tool, etc.) constructs a URL to a "DID Method", then in the response, there is a URL that resolves to the object.

javh commented 2 years ago

When that data is loaded into the ADC, the data repository is allowed to (and should) overwrite the AIRR _id fields with values that are GIDs or PIDs. Ensuring that the cross-references stay consistent and that those identifiers won't conflict with data in other repositories of the ADC. I personally would prefer they also be PIDs, with the ability to be resolved back to that object in the AIRR data repository.

This makes total sense to me. Sanitizing and standardizing existing fields upon import seems like the cleanest approach (pun intended).

scharch commented 2 years ago

@javh @schristley just to be clear, this includes overwriting/standardizing cell_id (#574) as well?

bcorrie commented 2 years ago

@javh @schristley just to be clear, this includes overwriting/standardizing cell_id (#574) as well?

I don't like overwriting the existing fields... Or at least before you state that is the best way forward, lets define which _id fields you think it is OK to overwrite...

bcorrie commented 2 years ago

Think of what the user would have to do for data processing - totally outside of the ADC. The ADC will solve this problem for the user if the user has ADC data (which would have global identifiers), but it doesn't help if they have non-ADC (and therefore non globally identified) AIRR compliant output files.

This is a more general problem than just the ADC, so how does the AIRR standard and format help them in general for performing multi-study analyses.

So right now if a user wants to compute on two data sets for which they have no idea whether the _id fields are unique, in order for the tools to work the user is required to pre-process all of the data, change all of the _id fields to something that ensures they are unique. In the 10X case, if doing this for cell_id, they have to ensure that they change cell_id in exactly the same way in (for one subject from one of the two data sets only):

HC1/vdj_b/airr_rearrangement.tsv HC1/vdj_t/airr_rearrangement.tsv HC1/vdj_b/cell_barcodes.json HC1/vdj_t/cell_barcodes.json HC1/count/sample_feature_bc_matrix/barcodes.tsv

As the user I probably missed some files...

In each case a cell_id like this: AAACCTGAGAAACCAT-1 in any of the files becomes: AAACCTGAGAAACCAT-1-DATASET1

So they write some "rock star" sed/awk script to do this...

Repeat for every subject.

Then do it all over again for the second data set, using the suffix "DATASET2"

Repeat for clone_id across all subjects in both studies Repeat for any other _id across all subjects in both studies

If the user somehow can do all of that without screwing up, the analysis tool of choice can then be run on the two data sets with guaranteed unique IDs (at least across these two studies) to actually do some work...

Repeat for every user that ever wants to compare two such data sets - including re-writing the sed/awk script. Now that sounds like a recipe for disaster. 8-)

The ADC will solve this for the user IF and ONLY IF they get the data out of the ADC. In the above context, the ADC would have replaced those IDs with globally unique IDs and the user wouldn't have to do anything. How can the AIRR Community help this problem in general.

bcorrie commented 2 years ago

I know TLDR - Bottom line is I think this is much easier to solve if you simple add a field called cell_gid (or cell_pid), clone_gid (or clone_pid), etc for which ever fields you feel need global identifiers for.

bcorrie commented 2 years ago

That would be a huge change, as the purpose of the current *_id field are to act as row keys. So we'd be asking tool developers to change their implementations by redefining the current *_id fields as not-keys and then adding a new key field.

Really - is it that hard for a tool to take an argument that specifies the key used to identify a unique cell. In particular, if cell_id has a global twin cell_gid, then the tool can choose to use one or the other (use cell_gid if it exists, cell_id if not). For an arbitrary and imaginary UMAP generation tool for some single-cell format file...

do_umap --cell-key=cell_id input.tsv do_umap --cell-key=cell_gid input_gid.tsv

I can't envision researchers complying fully with globally unique identifiers, so you'll still have to verify/munge incoming *_pid values if we add these fields. Yes?

No - because the spec says they need to be globally unique. So no munging required. If they follow the spec it is unique. If they don't it isn't AIRR compliant!

If you are worried about making it easy for researchers to create globally unique identifiers you have a tool in the AIRR library that adds a globally unique identifier (gid) for a field:

airr_generate_gid --original-key=cell_id --global-key=cell_gid HC1/vdj_b/airr_rearrangement.tsv HC1/vdj_t/airr_rearrangement.tsv HC1/vdj_b/cell_barcodes.json HC1/vdj_t/cell_barcodes.json HC1/count/sample_feature_bc_matrix/barcodes.tsv

The AIRR Standard does all the heavy lifting, the tool developer needs to some light lifting, the ADC and the repositories would generate cell_id and cell_gid automatically. The user doesn't really have to do anything other than maybe run generate_gid on a bunch of non -giddy data 8-)

schristley commented 2 years ago

According to the current schema with the draft objects, I consider these to be the AIRR _id fields:

repertoire_id (only one with a documented global uniqueness requirement)
repertoire_group_id
data_processing_id
sample_processing_id
sequence_id
clone_id
tree_id
cell_id
allele_description_id, gene_set_id (note: the germline ids are a bit funky because they weren't initially defined in the context of the AIRR Data Model, but we can fix that)
receptor_genotype_set_id, receptor_genotype_id, mhc_genotype_set_id, mhc_genotype_id (note: same note as for the germline ids)

scharch commented 2 years ago

@bcorrie I disagree on a few levels:

Message ID: @.*** com>

Most importantly is @javh's point about what is considered a key

The problem with your scenario is that you are correct: the poor hapless user should not be doing this himself, the tool should be doing it. SONAR has long had a rudimentary version of this, and it's on my (long, long) list of features to build out more fully.

Ultimately, though, the correct way to go about this, whether with ADC data or local data, is to create a RepertoireGroup, which will explicitly handle not only the id issues, but the provenance one, as well.

bcorrie commented 2 years ago

According to the current schema with the draft objects, I consider these to be the AIRR _id fields:

The only one of those that we currently overwrite is sequence_id, and that ID is the only one that is NOT used as a linking ID into other objects. I think the repercussions and side effects of choosing to overwrite all those other need to be carefully considered...

bcorrie commented 2 years ago

The problem with your scenario is that you are correct:

If only that was the usual case... 8-)

javh commented 2 years ago

I don't see putting a pid in cell_id as more work than putting a pid in cell_pid; the latter just adds an extra field for the same amount of effort.

You'll have no idea whether cell_pid is globally unique before looking at the values in it either. You'll have to do a compliance check and fix the field if it's not a pid, which you can also do on the _id field. IIRC, Olmstead is a good example: they have both original sequence id and uuid fields, but I don't know if the uuid would be globally unique in the ADC context, so you'd have to check.

The _id fields have a pseudo-enforcement mechanism to compel compliance right now, in that your analyses are likely to break if the ids aren't locally unique (bad practice ~ bad result). There's nothing really to suggest compliance for pids except adherence to the letter of the standard. Maybe that's too esoteric, but I think we want to keep the "spreadsheet analyst" in mind. You will get bad values in the _pid fields and will need a solution for that. And, to me, that seems like the solution would be the same regardless of whether it's modifying an _id or _pid field.

bcorrie commented 2 years ago

Most importantly is @javh's point about what is considered a key

Yes, but it can only be used as a key across multiple studies if it is made unique across those studies.

This ONLY helps if cell_id is ALWAYS guaranteed to be unique globally. That is what I am suggesting cell_gid is. cell_gid is a globally unique identifier by definition. If it isn't globally unique it shouldn't be set. Period. End of story.

If we don't have cell_gid and we only have cell_id but it is only sometimes globally unique (when it comes out of an ADC repository), you can NEVER be sure that you can use it as a global identifier... If it isn't guaranteed to be globally unique, then you can NEVER be sure that you can compare two studies from two different sources without generating unique IDs between them. So we are back to square one except within the ADC.

If cell_gid is there, and is guaranteed to be globally unique, ANY software tool can use it as they would a unique identifier for a cell no matter which data sets they are processing. The user and the tool developer are guaranteed that there are no Cell ID conflicts between the two data sets.

bcorrie commented 2 years ago

I don't see putting a pid in cell_id as more work than putting a pid in cell_pid; the latter just adds an extra field for the same amount of effort.

I can think of 5 Billion reasons why it is harder to over ride cell_id!

schristley commented 2 years ago

According to the current schema with the draft objects, I consider these to be the AIRR _id fields:

The only one of those that we currently overwrite is sequence_id, and that ID is the only one that is NOT used as a linking ID into other objects.

Except it does. The current clone object has a list of sequence_id's for the clone members. :-D

schristley commented 2 years ago

If cell_gid is there, and is guaranteed to be globally unique, ANY software tool can use it as they would a unique identifier for a cell no matter which data sets they are processing.

But who is guaranteeing this? In your example of a tool doing multi-study analysis for non-ADC AIRR files, who's setting cell_gid with a unique value?

scharch commented 2 years ago

SONAR has long had a rudimentary version of this

cellranger aggr does it, too...

scharch commented 2 years ago

This ONLY helps if cell_id is ALWAYS guaranteed to be unique globally.

Nope! I could give a fig if cell_id (or any other *_id) is unique globally; I only care if it's unique in the context of my analysis. That's why, to my knowledge, none of these tools worry about GIDs or PIDs in the first place. Even when combining analyses, _ids still only need to be unique in the context of the (broader) analysis. The ADC aspires, more or less, to contain everything, and hence needs GID/PIDs. But as long as my chosen tool generates a RepertoireGroup with locally unique (meta) ids, I'm all set.

scharch commented 2 years ago

I don't see putting a pid in cell_id as more work than putting a pid in cell_pid; the latter just adds an extra field for the same amount of effort.

I can think of 5 Billion reasons why it is harder to over ride cell_id!

I assume the thought here is that adding a field won't entail updating all of the cross-references, but it probably should... Adding a PID to the cell object alone wouldn't seem to be terribly helpful, I don't think. Though now I suddenly understand your comment here better....

bcorrie commented 2 years ago

The only one of those that we currently overwrite is sequence_id, and that ID is the only one that is NOT used as a linking ID into other objects.

Except it does. The current clone object has a list of sequence_id's for the clone members. :-D

OK, so it is worse than I thought... 8-)

javh commented 2 years ago

If sequence_id doesn't contain a globally unique identifier in the ADC, then what purpose does it serve? If we add sequence_pid then what's the use case within the ADC for sequence_id?

And if sequence_id doesn't contain a globally unique identifier within the ADC, then is the data in the ADC actually AIRR compliant? I think technically yes, as long as sequence_id is unique with in the given Repertoire. But, in the spirit of the standard I think no, because sequence_id is not actually unique within its use context, because the ADC context is everything.

bcorrie commented 2 years ago

This ONLY helps if cell_id is ALWAYS guaranteed to be unique globally.

Nope! I could give a fig if cell_id (or any other *_id) is unique globally; I only care if it's unique in the context of my analysis. That's why, to my knowledge, none of these tools worry about GIDs or PIDs in the first place. Even when combining analyses, _ids still only need to be unique in the context of the (broader) analysis. The ADC aspires, more or less, to contain everything, and hence needs GID/PIDs. But as long as my chosen tool generates a RepertoireGroup with locally unique (meta) ids, I'm all set.

Yes, but this means that somehow YOU have to have a mechanism to generate those locally unique (meta) ids for YOUR analysis. So you are taking a bunch of repertoires grouped in some way (by RepertoireGroup), reprocessing them to ensure that you don't have any conflicts on any of the non-unique IDs that you care about so that they are unique for your analysis. Recall Scott had a list of 14 that might be of interest - you care about cell_id and repertoire_id, I care about clone_id, cell_id, repertoire_id, and sequence_id.

Finally because you have to assume that the _id you care about is NOT unique across data sets, you will ALWAYS have to generate unique IDs every time you do an analysis.

So who does this unique ID processing??? You as the user either do that through some processing or the tool becomes intelligent enough to do that for you (as you are suggesting). In the first case, every user has to have this ability and do this for every analysis (a nightmare if you ask me - but the way it is today) or in the second case, every tool has to have this ability (a challenge for all of the tool builders but easy on the user).

What I am suggesting is that we have a field that is designed to take this load off of both the user and the tool developer. You don't care that you have globally unique IDs, true, but you do care that they are unique within your local analysis. What you want is a subset of what globally unique IDs provide. So I am suggesting we have both _id and _gid fields that provide this. If you want to call this a Unique ID (_uid) rather than a Global ID (_gid) so that it is more appropriate to your context, that is fine.

If _gid is set, the user AND the tool have to do NO extra work - you can just do your analysis with the tool using the _gid field. This would be the case if you were comparing any data in the ADC.
If you don't have a _gid then you are back in the dark ages. 8-) Without a _gid, you (or the tool you use) have to process all the _id fields and overwrite those fields to generate unique IDs and then do the processing. I am suggesting that adding a _gid field helps with this - as the data processing you are doing is data preserving (you aren't getting rid of data)...
If you do have a _gid but that _gid isn't set, you don't overwrite the old field, you generate a new value in the _gid field (you have both _id and _gid) so you are not losing any information. But note you now have uniqueness for your analysis.
Finally, in order to take a step out of the dark ages for the user, I am suggesting that it would not be too hard for the AIRR Community to develop an AIRR tool (as part of the AIRR library) that, given a set of data (possibly defined by a RepertoireGroup) generates _gid from _id. Because id and _gid would be part of the AIRR standard, this transformation is clear. So if you do end up with a data set for comparison that doesn't have _gids set, then it is easy for the user to generate them for their data set. The tools don't have to do anything other than be told which ID key to use for the analysis (_id or _gid).

bcorrie commented 2 years ago

And if sequence_id doesn't contain a globally unique identifier within the ADC, then is the data in the ADC actually AIRR compliant? I think technically yes, as long as sequence_id is unique with in the given Repertoire. But, in the spirit of the standard I think no, because sequence_id is not actually unique within its use context, because the ADC context is everything.

I think this is the whole point of this discussion. sequence_id is defined in some context (just like cell_id) and in my opinion it is not the roll of the ADC to replace information provided by an annotation tool or an analysis pipeline. In fact, I believe this to be a bad thing from a data provenance perspective. If we need a globally unique identifier for an object (like a Cell or a Rearrangement) then we should have a specific field for that global identifier. We should not be hijacking fields that come from annotation tools for this purpose...

javh commented 2 years ago

@bcorrie, If the only use case for the _id field is provenance, then I think it makes more sense to devise a schema for provenance rather than redefine existing fields to a new purpose. The _id fields are meant to be unique keys, not immutable labels or provenance trackers.

This reads to me like deprecating sequence_id and replacing it with a new field called sequence_gid that does the exact same thing as the old field for no other reason than preserving a sequence label a user uploaded. I don't see how we make an argument to tool developers (igblast, IMGT, cellranger, etc) that they spend their time adding support for a redundant identifier field.

SRA replaces all sequence ids with a globally unique identifier upon upload (SRR<X>.<Y>) and I have never once cared what the original identifier was after downloading data from SRA.

I think we might be coming at it from the wrong angle. What's the problem that needs to be solved by preserving the original identifiers? Maybe there's a more direct solution to that?

scharch commented 2 years ago

@bcorrie I don't think it's that complicated for tools that are creating RepertoireGroups to handle unique-ing of the _ids. But arguendo let's take it as a Thing To Avoid and look at your scenarios:

_gid is present and set: @javh and @schristley expressed some skepticism above that this could really be expected/enforced, but if generation of GIDs is included in the AIRR reference library, I could maybe see it.
_gid is present but not set: I don't see this as being fundamentally different than the "Dark Ages" scenario. As I mentioned above, I don't think it's as simple as just generating a new value --what good is a cell_gid in the Cell object if my Rearrangements are still referencing a non-unique cell_id? Or rearrangement_gid when Clone still uses rearrangement_id. Even if you're not overwriting anything (which may be a marginal benefit) you still have to do the same amount of work.

So having _gid be optional doesn't end up preventing the Thing To Avoid --even the ADC would have to (essentially) do the Thing To Avoid at data intake as described under scenario 2.

All this implies that _gid would have to be a required field. But if it's a required field, there's no reason to maintain both _id and _gid. Better to just redefine _id to require it to contain a GID (or maybe a PID or (U)UID? this stuff makes my head spin). That would, of course, be a major change to the schema, and one that I am pretty skeptical would find much compliance.

Which, I think, means that maybe expecting tools to handle unique-ification of _ids on a somewhat ad hoc basis is the least bad solution after all...

scharch commented 2 years ago

This reads to me like deprecating sequence_id and replacing it with a new field called sequence_gid that does the exact same thing as the old field for no other reason than preserving a sequence label a user uploaded. I don't see how we make an argument to tool developers (igblast, IMGT, cellranger, etc) that they spend their time adding support for a redundant identifier field.

@javh beat me to it

SRA replaces all sequence ids with a globally unique identifier upon upload (SRR<X>.<Y>) and I have never once cared what the original identifier was after downloading data from SRA.

exactly

schristley commented 2 years ago

We should not be hijacking fields that come from annotation tools for this purpose...

This is what I disagree with. @bcorrie seems to be making the statement that tools and data curators OWN these fields, but that's incorrect. AIRR owns these fields. These fields wouldn't exist if AIRR didn't define them. Instead each tool would have their own custom non-standardized fields, which was the wild west days before the AIRR standards.

If AIRR defines those fields, then AIRR gets to define what is and isn't allowed for those fields. There is no reason that AIRR has to support annotation tools hijacking those fields for their own purpose. The tools either conform and thus are AIRR-compliant, or they don't, in which case they cannot call them AIRR-compliant files.

bcorrie commented 2 years ago

This is what I disagree with. @bcorrie seems to be making the statement that tools and data curators OWN these fields, but that's incorrect. AIRR owns these fields. These fields wouldn't exist if AIRR didn't define them. Instead each tool would have their own custom non-standardized fields, which was the wild west days before the AIRR standards.

Yes, but the main purpose of the AIRR fields are to standardize the data produced by those tools so that the data is FAIR (or IR - interoperable and reusable). Most of the fields in the AIRR Standard are fields that are either designed to standardize study metadata or standardize fields that are produced by sequencing and annotation so that the data is indeed interoperable and reusable - as you say. A very small number of them are identifiers that link the AIRR Standard entities.

So that was careless phrasing on my part. I should have said "we should not be hijacking an AIRR field that is supposed to record information about a measured entity in a study (e.g. cell_id) for this purpose".

My argument is that cell_id as it is in the current spec (as almost all fields in the Cell object are) is designed to capture the ID of the cell from the sequencing/processing pipeline. If a tool produces an ID for a cell in a data set, it should go in cell_id. Just like v_call records the V gene call that the annotation tool produces, and junction_aa records the Junction AA sequence. These are annotations of a measured entity that come from a sequencing/processing pipeline.

If AIRR defines those fields, then AIRR gets to define what is and isn't allowed for those fields. There is no reason that AIRR has to support annotation tools hijacking those fields for their own purpose. The tools either conform and thus are AIRR-compliant, or they don't, in which case they cannot call them AIRR-compliant files.

So this is indeed the question and what we are discussing. Does the AIRR spec define cell_id as the cell_id that is recorded as part of an observation from a sequencing/measurement/data processing step? Or does the AIRR spec define the cell_id as a unique identifier that has nothing to do with such an observation/measurement (and therefore can be overwritten and changed by various tools based on their requirements)?

My argument is that is is valuable to maintain the information about the measured observation as part of the Cell object as well as provide an identifier than uniquely identify this cell in the broader context of an analysis...

airr-community / airr-standards

We need a more formal, fully-qualified identifiers for repository objects #347