airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Change germline_set_ref to be a CURIE #770

Open bcorrie opened 4 months ago

bcorrie commented 4 months ago

Closes #553

bcorrie commented 4 months ago

@williamdlees I made an attempt to make germline_set_ref a CURIE. It passes checks, have a look at what I did and see if it seems OK.

bcorrie commented 4 months ago

The main issue I see is that the python test data has IDs from IMGT for germline_set_ref of the form:

"germline_set_ref": "IMGT:Homo sapiens:2022.1.31"

Would this be referring to one of these: https://www.imgt.org/download/V-QUEST/archives/

bcorrie commented 4 months ago

Also, the CURIE PREFIX I used is OGRDB_GERMLINESET. CURIEs are supposed to make things short, and this prefix is long but if you are going to have other things that you want to look up (Alleles?) using a CURIE, we need something descriptive and different.

                "allele_description_id": "OGRDB:A00301",
                "allele_description_ref": "OGRDB:Mouse_IGH:IGHV-2DBF",
williamdlees commented 4 months ago

Thanks Brian. This looks good to me and I think it is the way to go. If other providers of germline sets (I can only think of Mixcr and IMGT for the time being) wish to offer sets in MiAIRR format, we can encourage them to provide a standardised URL and create a CURIE. But until they do, it isn't really an issue. And we do have other fields for name, date and version that can fill in if necessary.

schristley commented 4 months ago

Hi @williamdlees , my plan for our AKC 1-on-1 next week was to review this with you in the context of the OGRDB API because right now what's in the pull request does not resolve to a URL that returns the germline set. We will need to change the PR and/or the API so that it does. It won't be hard to do.

schristley commented 4 months ago

Also, the CURIE PREFIX I used is OGRDB_GERMLINESET. CURIEs are supposed to make things short, and this prefix is long but if you are going to have other things that you want to look up (Alleles?) using a CURIE, we need something descriptive and different.

@bcorrie My suggestion is to not make separate CURIEs for each data type. If you remember how James described it, there is a global part and a local part. OGRDB: is a sufficient prefix for everything the global OGRDB service provides. While the local part, @williamdlees will have control over. That design allows OGRDB to provide additional services down the road without requiring new CURIE prefixes, instead the local part can be enhanced.

williamdlees commented 4 months ago

I’ve been thinking about Scott’s helpful comments.

Currently the field is described as ‘Unique identifier of the germline set and version, in standardized form (Repo:Label:Version)’. I think the overall motivation of this change is to provide a URI that will download the set, rather than this standardized form. That seems a reasonable thing to do, and doesn’t impact any user code out there, because the standardized form isn’t really usable by code today. An example of a URL from OGRDB, which works today, would be https://ogrdb.airr-community.org/api/germline/set/Human/IGH_VDJ/8/airr_ex. As I understand a CURIE, it would provide a shorthand that avoids the need to write out the URL in full in the rest of the schema definition. The thing that worries me is that it bakes OGRDB into the definition.

We don’t intend OGRDB to be the only source of germline sets. It would be great if others, for example IMGT and MiLabs, started to support the MiAIRR standard for germline sets. However, if they do this, would we need to add CURIES to the schema definition for their systems? To do so would be much more effort on our part than we save by adding a URL shortcut. And I fear it might also discourage them, and send a general message that, for the MiAIRR standard, OGRDB is the single repository for reference sets.

Is there a way to use a CURIE optionally in a URI? If so, this might be a way forward, although I don’t see much benefit in the CURIE here myself. The alternative would be to change the field description to ‘URL of the germline set in the repository from which it can be downloaded’ and use the URL above as an example.

All the best

William

schristley commented 4 months ago

I think the overall motivation of this change is to provide a URI that will download the set, rather than this standardized form.

That's correct, though there is a bit more to this. We want germline sets to conform to the FAIR principles, and more importantly for the OGRDB service (and the data it provides like germline sets) to conform to the FAIR principles.

As I understand a CURIE, it would provide a shorthand that avoids the need to write out the URL in full in the rest of the schema definition.

Exactly, it is primarily shorthand for the full URL, but it is important not to think of it purely as "URL to download the germline set". It is more than that. It is a permanent ID so that if you look at two studies, you can simply compare the IDs to know that they are using the exact same germline set, or not. It is also "F"indable and "R"eusable because I can download the exact same germline set based upon its ID. Some people tend to interpret the "R" as reproducibility.

The thing that worries me is that it bakes OGRDB into the definition. We don’t intend OGRDB to be the only source of germline sets. It would be great if others, for example IMGT and MiLabs, started to support the MiAIRR standard for germline sets. However, if they do this, would we need to add CURIES to the schema definition for their systems?

Yes, if they want to conform to the FAIR principles, which they should. It is easy for us to add CURIEs for their systems. The "I"nteroperability of FAIR is where the standard schemas and formats become important. But interoperability doesn't imply that everybody has to agree and conform to the same schema, there are alternative ways to be interoperable.

To do so would be much more effort on our part than we save by adding a URL shortcut. And I fear it might also discourage them, and send a general message that, for the MiAIRR standard, OGRDB is the single repository for reference sets. Is there a way to use a CURIE optionally in a URI? If so, this might be a way forward, although I don’t see much benefit in the CURIE here myself. The alternative would be to change the field description to ‘URL of the germline set in the repository from which it can be downloaded’ and use the URL above as an example.

I'm not sure why you think this would be more effort on our part. It's really effort on their part. The push toward FAIR-ness is pretty much unstoppable at this point, and if they want to be used and relevant then they need to conform to the FAIR principles. Note that this is not a blanket statement, thou must use the AIRR standard. And while "standard" can be used as a bludgeon to keep people in line, it can also just mean that a group agrees to the same set of general principles. While the field does not (technically) require that germline sets be in the AIRR standard format, because this is the AIRR Community and these are our AIRR Standards, there is that implication.

williamdlees commented 4 months ago

ThatOn 3 Mar 2024, at 21:10, Scott Christley @.***> wrote:

I think the overall motivation of this change is to provide a URI that will download the set, rather than this standardized form.

That's correct, though there is a bit more to this. We want germline sets to conform to the FAIR principles, and more importantly for the OGRDB service (and the data it provides like germline sets) to conform to the FAIR principles.

As I understand a CURIE, it would provide a shorthand that avoids the need to write out the URL in full in the rest of the schema definition.

Exactly, it is primarily shorthand for the full URL, but it is important not to think of it purely as "URL to download the germline set". It is more than that. It is a permanent ID so that if you look at two studies, you can simply compare the IDs to know that they are using the exact same germline set, or not. It is also "F"indable and "R"eusable because I can download the exact same germline set based upon its ID. Some people tend to interpret the "R" as reproducibility.

The thing that worries me is that it bakes OGRDB into the definition. We don’t intend OGRDB to be the only source of germline sets. It would be great if others, for example IMGT and MiLabs, started to support the MiAIRR standard for germline sets. However, if they do this, would we need to add CURIES to the schema definition for their systems?

Yes, if they want to conform to the FAIR principles, which they should. It is easy for us to add CURIEs for their systems. The "I"nteroperability of FAIR is where the standard schemas and formats become important. But interoperability doesn't imply that everybody has to agree and conform to the same schema, there are alternative ways to be interoperable.

To do so would be much more effort on our part than we save by adding a URL shortcut. And I fear it might also discourage them, and send a general message that, for the MiAIRR standard, OGRDB is the single repository for reference sets. Is there a way to use a CURIE optionally in a URI? If so, this might be a way forward, although I don’t see much benefit in the CURIE here myself. The alternative would be to change the field description to ‘URL of the germline set in the repository from which it can be downloaded’ and use the URL above as an example.

I'm not sure why you think this would be more effort on our part. It's really effort on their part. The push toward FAIR-ness is pretty much unstoppable at this point, and if they want to be used and relevant then they need to conform to the FAIR principles. Note that this is not a blanket statement, thou must use the AIRR standard. And while "standard" can be used as a bludgeon to keep people in line, it can also just mean that a group agrees to the same set of general principles. While the field does not (technically) require that germline sets be in the AIRR standard format, because this is the AIRR Community and these are our AIRR Standards, there is that implication.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

williamdlees commented 4 months ago

Scott I take your point about the URI being a permanent identifier, but that does not to my mind have anything to do with CURIES. Do we have to use a CURIE to comply with FAIR? If not I would much rather not in this instance on the  grounds that 1 we save no appreciable effort by using a CURIE here and 2 it does take effort to check out the schema, make a change to it, coordinate with other changes and all the other stuff we end up doing when making a change to the standard.Thanks for your helpWilliam

schristley commented 4 months ago

Do we have to use a CURIE to comply with FAIR?

Nope, it is just a useful shorthand. In fact, IEDB isn't using CURIEs, you can see in their export table that they provide IRIs.

There are some small advantages to CURIEs, 1) it's shorter and thus uses less space in a database, likely not relevant for germline sets but imagine rearrangements where you could be talking GBs of extra data (I don't know if IEDB stores the complete IRI in the database, it may actually just store IDs and construct the IRI when generating the export, which makes this point mute), and 2) if there's every a crazy reason for https://ogrdb.airr-community.org to be moved, changing the CURIE pointer is a lot easier than rewriting all of the IRIs. But these really are small points. And on the flip side, there's an advantage to just having the IRI because you don't need to do the CURIE resolution.