airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Use CURIEs to link Germline to Repertoire/Rearrangement #553

Open bcorrie opened 2 years ago

bcorrie commented 2 years ago

Should resolve Task 3 in #157

See https://github.com/airr-community/airr-standards/issues/157#issuecomment-921872075 for discussion.

schristley commented 2 years ago

That task item is old, it's saying the germline database link is at the rearrangement level, but this is no longer true, it was put in DataProcessing awhile ago. Unless this is referring to something else...

bcorrie commented 2 years ago

For me this is addressing something else. The issue that this is capturing is formalizing that we will use CURIEs to refer to Germline objects, either Germline sets or Germline Genes - currently we have no formal indication as to what format fields like germline_set_id and germline_database take on and the formats we do have aren't really "computable". A generic string designation is "unsatisfactory" and CURIEs solve this problem 8-)

From https://github.com/airr-community/airr-standards/issues/157#issuecomment-921872075 (with some edits) I suggested we could use CURIE nomenclature for Germline IDs as follows:

bcorrie commented 2 years ago

@williamdlees a question regarding connecting OGRDB Germline Sets to the AIRR Schema. I was just looking at OGRDB's germline sets, and was wondering if I understood how things were working.

On OGRDB the ID of the Germline Set is referenced in the URL as such: https://ogrdb.airr-community.org/germline_set/3

On the History tab for this germline set it says this is G00003.

If I wanted to refer to this Germline Set in the AIRR Schema how would I go about doing that? There seem to be two places where this might occur:

Genotype.documented_alleles.germline_set_ref GermlineSet.germline_set_id

If we used the AIRR CURIE schema, we could have an OGRDB CURIEMap as follows:

  OGRDB_GERMLINESET:
    type: catalog
    default:
      map: OGRDB
      provider: OGRDB
    map:
      OGRDB:
        iri_prefix: "https://ogrdb.airr-community.org/germline_set/"

If I then set Genotype.documented_alleles.germline_set_ref = "OGRDB_GERMLINESET:3" then that CURIE would resolve to:

https://ogrdb.airr-community.org/germline_set/3

Now this is different than what you have in the AIRR Spec description, as the description for Genotype.documented_alleles.germline_set_ref says:

        germline_set_ref:
            type: string
            description: Unique identifier of the germline set and version, in standardized form (Repo:Label:Version)
            example: OGRDB:Human_IGH:2021.11
            x-airr:
                nullable: false

What are your thoughts on using the CURIE mechanism above to resolve this field? If you look at the versions tab on OGRDB for this Germline Set it has all the above information:

[BALB/c IGH](https://ogrdb.airr-community.org/germline_set/3)   Mouse   BALB/c  IGH 1   2022-02-28
williamdlees commented 2 years ago

Pasting this here as the mail reply didn't make it in to the thread

It’s probably best to get the set from the REST API at https://ogrdb.airr-community.org/api/rather than the UI. Sorry, I could publish this a bit better, I will put some details on the Germline Sets page for a start.

The germline set will always have an identifier G followed by a number and the identifier will not change between versions.

From the API you’d retrieve the set as, for example, https://ogrdb.airr-community.org/api/germline/set/G00003/1 . It sounds as though this would map quite nicely – maybe OGRDB_GERMLINESET:G00003:1 ??

If that’s ok I can change the examples

williamdlees commented 1 year ago

Hi Brian,

It’s probably best to get the set from the REST API at https://ogrdb.airr-community.org/api/rather than the UI. Sorry, I could publish this a bit better, I will put some details on the Germline Sets page for a start.

The germline set will always have an identifier G followed by a number and the identifier will not change between versions.

From the API you’d retrieve the set as, for example, https://ogrdb.airr-community.org/api/germline/set/G00003/1 . It sounds as though this would map quite nicely – maybe OGRDB_GERMLINESET:G00003:1 ??

If that’s ok I can change the examples

javh edit: remove email headers.

bcorrie commented 7 months ago

@williamdlees has this been resolved? I am triaging AIRR v2.0 issues 8-)

bcorrie commented 7 months ago

Currently it doesn't look like germline_set_ref is a CURIE mappable entity as there are two levels to the reference (e.g. OGRDB:Human_IGH:2021.11) so maybe not. But if that is the case for v2.0 we should remove this issue from v2.0 and perhaps close it if that field can't be mapped with a CURIE.

williamdlees commented 7 months ago

See my note from 2022(!) in the thread. I am no expert in CURIES but if representing them in the way I suggest is compatible with the way you outline further up the thread, there’s no work involved, it’s very do-able.

bcorrie commented 7 months ago

@williamdlees I am almost 100% sure that CURIEs only have a single IRI tag followed by a single identifier. So something like "OGRDB_GERMLINESET:G00003:1" mapping to "https://ogrdb.airr-community.org/api/germline/set/G00003/1" would not be valid CURIE processing/parsing.

Don't get me wrong, the ID "OGRDB_GERMLINESET:G00003:1" is easily parsed as an ID, but it does not fit the CURIE format. If that was a CURIE and the IRI tag "OGRDB_GERMLINESET" was mapped to "https://ogrdb.airr-community.org/api/germline/set/" then this would resolve to:

https://ogrdb.airr-community.org/api/germline/set/G00003:1

I think it is fine to have the ID as you have it defined if that fits your needs. It just isn't CURIE parseable, and it can't go into the CURIEMap object in the spec.

So we could consider this resolved as is. We have decided that CURIEs don't fit the needs of germline set IDs. Therefore we don't need to change your ID definition and we don't need to update the CURIEMap. I think that is the most simple path forward. This can always be revisited later...

schristley commented 7 months ago

And this is somewhat of an aside, but as part of the AKC work, the OGRDB API needs to be reviewed and updated to bring it more in compliance as well as add missing functionality. It might be more efficient to do all that together instead of piecemeal.

Nevertheless, my opinion is that germline_set_ref should be a CURIE mappable entity.

williamdlees commented 7 months ago

It’s not a question of fitting my needs, the choice of : as a delimiter between the germline set and version was arbitrary. Is there a convention for that delimiter in the curie world? If so I am happy to follow it, otherwise we can just choose something that won’t crash the syntax, maybe . or /.  I’m happy to make the change.

bussec commented 7 months ago

@williamdlees The relavant documentation can be found here:

In a nutshell: You can have a : as part of the reference, but not of the prefix. To avoid potential confusion (and overly simplified parsing routines), it would be best to avoid having more then a single colon. . and / should be save.

bcorrie commented 7 months ago

In a nutshell: You can have a : as part of the reference, but not of the prefix. To avoid potential confusion (and overly simplified parsing routines), it would be best to avoid having more then a single colon. . and / should be save.

@bussec is that correct? Would not '/' cause problems. CURIEs rely on IRIs and '/' is a special character in IRI space. If you have a '/' in the CURIE reference it would be interpreted as a '/' in the IRI and interpreted as an IRI path, no?

Now I suppose if you had OGRDB_GERMLINESET:G00003/1 and "OGRDB_GERMLINESET" was mapped to "https://ogrdb.airr-community.org/api/germline/set/" then this would resolve to an IRI as:

https://ogrdb.airr-community.org/api/germline/set/G00003/1

That is what @williamdlees is looking for, and it would work I suppose, but encoding IRI path in the CURIE reference doesn't seem to be how CURIEs were intended to be used???

bcorrie commented 7 months ago

It’s not a question of fitting my needs, the choice of : as a delimiter between the germline set and version was arbitrary. Is there a convention for that delimiter in the curie world

@williamdlees I think the question that I am unclear on is are there two name/ID spaces, each with their own set of identifiers ("germline set" and "version") or can there be one name/ID space ("versioned germline set"). With one name space you could:

OGRDB_GERMLINESET:G00003-1 with "OGRDB_GERMLINESET=https://ogrdb.airr-community.org/api/germline/set/"

Would give you: https://ogrdb.airr-community.org/api/germline/set/G00003-1

Or

OGRDB_GERMLINESET:G00003-1 with "OGRDB_GERMLINESET=https://ogrdb.airr-community.org/api/germline/set?"

Which would give you a get query: https://ogrdb.airr-community.org/api/germline/set?G00003-1

The query approach might be the better one, as then you can parse the ID in what ever way you want. You could encode whatever you wanted in the ID and the query would parse it and return the correct information for that ID.

Note if you needed both API interfaces, you could make it such that:

https://ogrdb.airr-community.org/api/germline/set?G00003-1 https://ogrdb.airr-community.org/api/germline/set/G00003/1

gave the same information, the first being the one that was used for CURIE resolution.

williamdlees commented 6 months ago

Thanks Brian. Sorry for the delay in replying, I have looked at this from time and failed to come to a decision on which approach to take, because it doesn’t make much odds from a coding point of view. I think the one you suggest is probably the cleanest:

OGRDB_GERMLINESET:G00003-1 with "OGRDB_GERMLINESET=https://ogrdb.airr-community.org/api/germline/set?"

Which would give you a get query: https://ogrdb.airr-community.org/api/germline/set?G00003-1

bussec commented 6 months ago

@bcorrie After some meditation on the sacred scripture of RFC3987 and the epiphany that their notation is indentation-sensitive, one correction and one comment to my previous statement:

bcorrie commented 6 months ago

@williamdlees I think that would work. I think all we would need to do is update the descriptions for the three instances of germline_set_ref in the spec, is that correct?

javh commented 2 days ago

@williamdlees , can we remove this from the AIRR v2.0 milestone and move it to the AKC milestone? Do we need this functionality in the AIRR schema for v2.0? Do you think we can resolve questions about prefix and target uri?

williamdlees commented 2 days ago

I'm happy to work with whatever milestone makes sense to the group.

You may be aware that, with substantial input from Scott, we have drafted an implemented a revised API for ogrdb which is openapi3 compatible. Details here: https://ogrdb.airr-community.org/api_v2/swagger/.

I'm afraid I don't feel confident to draft a CURIE definition that will pass muster with the group, as the history on this thread shows, but if someone else can propose one that is satisfactory, and complies with our API schema, I am very happy to implement it in OGRDB.