Should we use a single gene ID or use CURIEs to support multiple gene IDs? (Does anyone need entrez id for genes?)

Relequestual commented 9 years ago

Entrez ID is just numeric. Other forms of IDs are prefixed with a sort of identifier. Other gene identifiers are also just numeirc. Although currently Entrez ID is the only one that's accepted that is only numeirc, but unless someone specifically requires it, I think we should remove it.

If we do need it, then we should have prefixes for the id, or to make the attribute name required to be the type of ID.

cc @MatchmakerExchange/committers

mellybelly commented 9 years ago

Elsewhere in the schemas we have ExternalIdentifier, defined here: https://github.com/ga4gh/schemas/blob/4220e94d7e87fae0aa45281c7b560a6362b38293/src/main/resources/avro/common.avdl

I think we should use a CURIE.

Relequestual commented 9 years ago

I agree, but the intension of this specific issue raised was to sidestep that issue for now, and re-address in the next version itteration. If the case is, none of the current implementers require Entrez ID, then we can remove it for now.

As it stands, others have started including additional attribute names for alternate ids, prefixed with an underscore.

If you think supporting n gene id types would be benificial, then by all means create an issue for it. I would be genuinly intreeged to hear justification for such, as I can't think of any. =]

Relequestual commented 9 years ago

Miss-clicked. Didn't intend to close issue.

fschiettecatte commented 9 years ago

The problem with removing Entrez Gene ID is that there isn’t a 1-1 mapping between the three identifiers we use, namely Ensembl Gene ID and Gene Name. Gene Name is doubly messy because they can change and can be reused. Additionally there may be Entrez Gene ID for which there is no corresponding Ensembl Gene ID/Gene Name and vice-versa. That being said they could be edge cases, I identified 35 issues out of 39,439 entries in the mapping table we use in Gene Matcher.

Relequestual commented 9 years ago

Decipher considers HGNC as the data authority on genes. We don't recognise any genes which do not have a HGNC ID. We have mappings from other IDs to HGNC ID.

We also keep old names for HGNC IDs, which for us handles those problem cases.

fschiettecatte commented 9 years ago

Ok, I’ll toss out a strawman suggestion, how about switching to simply using the HGNC ID in the next release of MME?

Personally I am not bound to any particular ID, and would be happy to use a single one if it gives us coverage and immutability.

Relequestual commented 9 years ago

I'd be happy with that. We should bring this up as an agenda item for the call on Friday.

Would work for 1.0.1 as the change would be backwards compatible, but I expect we will want to add more changes and just go straight to 1.1 rather than having several incremental releases.

fschiettecatte commented 9 years ago

That being said, Decipher is the only resource I know of that uses the HGNC ID, for example everything else I have had to deal with building OMIM uses either the Entrez Gene ID or the Ensembl Gene ID, or the Gene Name (which is not smart). For example NCBI uses the Entrez Gene ID pretty exclusively for identifying genes in their data. So we may be stuck with a multi-ID world.

jawahar1 commented 9 years ago

We thought using HGNC would be unambiguous since they're supposed to the official gene nomenclature provider. However, since there are so many names from so many different sources all referring to the same gene, it may be best to go with multiple IDs for now but to avoid confusion maybe the identifier should have a key giving the source of the identifier.

mellybelly commented 9 years ago

You can easily reconcile identifiers from these sources, can we just allow any of these three (e.g. Entrez, Ensembl, and HGNC)? and then use a CURIE?

buske commented 9 years ago

I agree we should use CURIEs (regardless of whether we support one or multiple identifiers).

I really like your strawman, @fschiettecatte. I would much prefer having a single ID format on the wire wherever reasonably because it simplifies everything. If we support N formats on the wire, every service must be able to covert between all N wire formats and their internal format. If we support 1 format on the wire, every service only has to convert between that wire format and their internal format, and only when the two are different. If the service wants to allow the end user to specify genes in multiple formats, that's their prerogative, and they'll have to be able to convert between those formats and some internal format anyway to enable matchmaking. Because we are in a primarily clinical domain, where candidate genes are often specified simply by a gene name, I would suggest this wire format for "gene concepts" be the HGNC ID.

fschiettecatte commented 9 years ago

@buske it it a strawman, I did take it apart in a subsequent comment. I don't think we can live with a single ID because there isn't a 1-1 correspondence between ID spaces. I think it makes sense to phase out gene symbol in favor of HGNC ID in the future.

buske commented 9 years ago

@fschiettecatte I saw that, but I didn't quite consider that as a deal-breaker per se. These incongruities already have to be handled by each site, and this handling is currently done independently and ad hoc.

But fair enough. I just realized how similar this conversation is to https://github.com/MatchmakerExchange/mme-apis/issues/62. Happy to just stick to adding CURIEs for now.

Relequestual commented 9 years ago

I have some new information for our consideration!

Previously we had a heated discussion about the fact that the HPO is updated "whenever" and there's no stable releases. I think we broadly agreed that we would prefer stable. This is the same case with HGNC gene names. These may be updated nightly or whenever.

Ensembl gene IDs however are only updated with the 2.5 / 3 monthly release cycle of Ensembl. However, we also support genes which potentially have no Ensembl ID.

There are some problems with mapping between HGNC and Ensembl. Especially as you can't determin the assembly the gene is found in with HGCN. Some genes it is not possible to do a "round trip" translation between Ensembl and HGNC. Some you would have to confirm with the synonyms.

There is also a situation where there are two genes, with different Ensembl IDs, but the same HGNC IDs, because the location is the same (but the strand is opposite).

The implications are confusing for sure. I'm not really sure of a best solution. I guess most of the time, calculations and such on a variant are "best guess" efforts, including our matching aglorthms.

I think I'd have to go with CURIEs as the potentially best solution. I notice that that in @fschiettecatte implementation for GeneMatcher, they also provide the alternate IDs where avalible.

Realistically, because of the non existent stable release cycles of some sources, we're always going to end up with a situation where either the gene is "wrong" or not recognised by the receiving database. I feel that sending multiple IDs is a way to attempt to work round this issue.

Because of Ensembls release seems the most stable, I would put forward that using that as the required ID format, but allowing for it to also have a value of "NA", but providing additional IDs, using CURIEs where Ensembl ID is not avalible. Ensembl still supports grch37, which is an important factor. Making Ensembl ID the "first port of call" feels like a sensible approach to covering the majority of cases, making responses as quick as possible, most of the time.

I realise this suggestion may sound less than ideal, but we're faced with a less than ideal situation, and less than ideal ways of attaining an authoritative full list of genes.

fschiettecatte commented 9 years ago

Thanks @Relequestual, that explains very well what I was driving at. The mapping issues exists between all IDs, and this is compounded by HGNC having both curated and non-curated mappings. A further funky situation exists with Entrez and Ensembl mappings, where EBI asserts a mapping but NCBI does not.

We use Ensembl IDs within PhenoDB, and use multiple IDs within GeneMatcher. GeneMatcher accepts Gene Names, Ensembl IDs and Entrez Gene IDs. And there is a weekly process which keeps the IDs up to date so we always have current data.

fcunningham commented 9 years ago

If you have a list of specific genes that are missing from Ensembl then do send them to helpdesk@ensembl.org. Mapping HGNC ids to Ensembl genes, well to any resource, is a tricky work in progress but if you have data that can help this then send it along. thanks.

simonbrent commented 9 years ago

Hi @fcunningham ,

The problem is not preciesly that Ensembl is missing genes. To aid understanding fully, I will detail the process I went through when we initially uncovered this issue.

We have a variant in Decipher with a VEP annotation which gave a consequence in gene CXorf59, on GRCh37.
At some point after this annotation was done, HGNC changed the name of CXorf59 to CHDC2.
On GRCh38, three genes were merged together: CXorf22, CHDC2, CXorf30 (listed here from lowest to highest start) to become CFAP47. You can see this gene in HGNC here: http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=HGNC:26708
This page then has a link to Ensembl GRCh38: http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000165164;r=X:35919734-36385319
HGNC doesn't have entries for GRCh37-only genes like CHDC2, since they are now synonyms for the new genes on GRCh38, but in Decipher we want the GRCh37 gene name, and it's position
We therefore follow the link to Ensembl GRCh38, and see another link to "View this locus in the GRCh37 archive". This link takes us here: http://grch37.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000165164;r=X:35937851-36008269
We are now looking at gene CXorf22, rather than CHDC2, since Ensembl only provided one link, despite CFAP47 being made up of three genes from GRCh37
We compare the position of this gene we have found (CXorf22) to the position of the original variant, and find that the variant is not in this gene
Our website gets confused and things go a bit wrong

This, I feel, is less of an issue of mapping between Ensembl and HGNC as it is of mapping between Ensembl GRCh38 and Ensembl GRCh37.

There are also 7 pairs of genes in Ensembl GRCh37 which map to the same HGNC name (i.e., 14 stable ids for only 7 HGNC display labels), all of which look rather dubious to me - 12 of them are miRNA genes, either close together or co-located but on different strands, but one pair - UGT2A1 - are co-located on the same strand, along with a third gene - UGT2A2, which to me look like they could all be one gene (see here: http://grch37.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000173610;r=4:70454135-70518965 the bottom transcript is in a different gene with the same name to the highlighted ones at the top).

However, none of this really addresses our issue, which is this: to date we have been using HGNC as a starting point for finding genes to put into our database, then somehow (I didn't write the script that does this and have never seen it) going from there to Ensembl GRCh37 to find positions, and then falling back to going to UCSC if on Ensembl link is found.

This is obviously a flawed approach, since UCSC's genes all appear to be in GRCh38 coordinates, and so don't match the rest of our database.

I am therefore planning to change our process to start with Ensembl GRCh37, find all the genes with HGNC names, and grab those names plus Ensembl's positions. As Ensembl appears to be the only reliable source of GRCh37 left on the internet, this seems like the only way to know for sure that you've got GRCh37 positions for your GRCh37 gene names. The obvious downside is that we end up with legacy gene names, which HGNC has now replaced, but it seems more important to me to be internally consistent than to have names which relate to genes which may not be in the same place as the genes we have.

I would suggest that anyone else who wants to support GRCh37 genes and cares about their positions needs to follow a similar process, where they start with a resource that declares the gene positions and has an HGNC name attached, and then verify these names against the HGNC dataset (checking symbol, alias_name and prev_symbol) to ensure that these names are actually in HGNC.

pdl commented 9 years ago

@relequestual asks "Which gene IDs should be used to specify a gene?"

Well, what is a gene?

A gene is a portion of DNA which is thought to have one or more functions and to which a name has been given. Sometimes multiple names, even by a single authority. Our understanding of where a gene is and how big it is and what its function is change over time, especially across reference assemblies. One name may be replaced with multiple terms vice versa over time. Whatever the scientific justification (and apologies if I've abused the science to make a point), there's an uncomfortable amount of uncertainty for us as developers working with genes, and we need to plan for multiple different suboptimal scenarios.

Let's step back.

Clients have terms in a given one vocabulary but are seeking information from servers which may internally use another vocabulary. The coverage of that vocabulary may differ and a term in one ontology may be partially mapped, incorrectly mapped or mapped to multiple terms in the other vocabulary.

Under what circumstances should a request with a term of this sort be processed?

The answer in some cases will depend on the nature of the request and the system, but there are some things we can say for sure. The scenarios as I see them are as follows

There is an accurate understanding between the client and the server about what the term represents (this is the best case scenario), in which case the request should be processed. Obviously, we should maximise the chances of this scenario.
The client refers to a term which the server has no prior knowledge of. In this case, processing typically can not and should not be attempted, and an error should be returned instead.
The client refers to a term and the server has multiple records which match this term. The server may refuse to process, especially if the operation is costly, irreversible, etc., and may indicate how the client might resolve the ambiguity (e.g. by providing options; but this might be better served by a different request). Alternatively, processing may take place but the server should in the response indicate how it resolved the ambiguity (again, it might also indicate alternatives, but this is not strictly necessary).
Both client and server believe a term to have a single meaning but they differ on what this meaning is. This is the worst-case scenario.

It is essential that the client on receiving the response can determine which of these four scenarios it represents. Because the server cannot distinguish between the first and the fourth case, the server must make reference in its response to a term which is accurate and unambiguous in relation to the information it is returning. The client then has as much information as it needs in order to decide whether the response is sufficient to meet its needs.

Therefore either:

A server must only process requests which use terms which are guaranteed to be unambiguously equivalent to items in its own controlled vocabulary, or
A server may process requests which use terms which are unambiguous, provided it has a way of confirming unambiguously how it processed the request

In either case it is necessary to be able to refer to unambiguous meanings from a controlled vocabulary.

Because there are multiple controlled vocabularies in existence, for an interoperable solution, an unambiguous term needs to consist of two parts: a unique identifier for the controlled vocabulary, and the unique identifier within the vocabulary. Not all vocabularies will be suitable for using as identifiers because their terms are not necessarily unambiguous identifiers (which I suspect might be the case for HGNC, at least unless it starts versioning).

So yes, CURIEs look like a good start.

buske commented 9 years ago

Thank you, @pdl. I think you phrased the broader concerns and context nicely. One of the things I struggle with within the MME is that, although the clients of the API are other servers, it seems desirable to tailor the sort of data being transferred and specified to the sort of data the end users are providing (clinicians and researchers). Unlike the main GA4GH APIs, in most cases the data the matchmaking services have is not generated from automated methods, and are therefore frequently underspecified things, like gene symbols.

Where in the pipeline do we attempt to resolve this ambiguity? Do we put the burden on the services and require everything on the wire to be unambiguous (and just leave out data that can't be uniquely resolved)? Or do we corrupt the wire by making it accurately capture the data (and the uncertainty in the data) that we do have? The latter seems ideal in theory, but in practice?

@fschiettecatte, GeneMatcher has done an excellent job of being interoperable with multiple id systems. My temptation is to try to standardize for simplicities sake, even if it means losing expressiveness, but I probably quell those authoritarian leanings.

@Relequestual, I like the idea of having Ensembl IDs be a soft requirement. This should get us a single wire format for 99% of the cases. In cases where an gene cannot be correctly labeled with an Ensembl ID, we can just use the same field with a CURIE. Not sure what you meant by NAs though, since I don't think that's necessary.

pdl commented 9 years ago

Do we put the burden on the services and require everything on the wire to be unambiguous (and just leave out data that can't be uniquely resolved)?

I suspect this is not actually an option unless we all decide we will all use one authority internally as well as externally and refuse to carry out cross-assembly matches. (And judging by @fschiettecatte's comments, I think this would be a problem for at least GeneMatcher and DECIPHER if not others as well).

I say this is not an option is because there are three potential sources of ambiguity and at least one of them can not be known until the server tries to process the request:

Input: The user has input text which could correspond to more than one term in different vocabularies.
Polysemy: A single term is defined by the same vocabulary to have multiple different meanings. (e.g. recycled gene symbols).
Mapping: A term in one vocabulary does not have a one-to-one relationship with a term in another vocabulary.

I would favour treating input ambiguities as the problem of the client (because the client knows more about the user/patient and is in a better position to resolve with UX), mapping ambiguities as the problem of the server (because the server knows more about the data it's mapping to, which might not all be in one vocabulary, and is in a better position to guess) and polysemy ambiguities as effectively part of the mapping problem (because by definition they can't be resolved without mapping anyway).

That is, clinicians could enter HGNC or whatever and the depositing site should able to determine and if necessary confirm the term/meaning (and convert to their own internal representation if they are sufficiently confident), then send a request as a fully qualified term (i.e. {"authority":"hgnc", "id":"ARID1B"}), then servers should do the right thing, and include in their response the fully qualified term which matched (which is likely to be Ensembl or UCSC, at least for those who store variant data).

I would add the caveats that:

each implementation will support at least one vocabulary
any mapping done into that vocabulary is on a strictly 'best effort' basis and there is no expectation that the mapping will be complete
there should be a distinction made in responses between cases where the term is not recognised and where the terms were understood but for which no data exists.

Finally when I said

unambiguous in relation to the information it is returning

I think this needs clarifying. I think this will probably mean Ensembl Gene ID in DECIPHER (because even though people enter an HGNC code, we will ensure that the position matches an Ensembl ID) but could mean that GeneMatcher responds with an HGNC symbol OR Ensembl ID OR Entrez Gene ID, depending on how the data was deposited. I perhaps should have said

in the fully qualified form which most accurately represents the information being returned.

buske commented 9 years ago

I'm terribly sorry if I'm being dense, but I feel like I'm missing why having a single wire format forces anyone to adopt any authority or prevents mapping cross-build. To me, it simplifies everyone's implementation, regardless of internal data formats, and makes correct and valid mappings more likely.

Here's an example:

A -----> B <-----> C    (entities)
    X    Y    S    Z    (vocabulary)

No matter what, to receive data from A, B must map from the vocabulary A uses to provide data (X) to B's internal vocabulary (Y) and resolve any input ambiguity there.

Now, if we place little restriction what vocabularies are supported on the B-C wire, B will directly send the request to C (using vocabulary Y). C then has to convert from Y (B's internal format) to Z (C's internal format), and respond with details about any ambiguities or errors encountered therein.

Now, if we restrict what is on the B-C wire to only support a particular vocabulary S, B must resolve mapping from Y to S before performing the request (if Y and S are different). As S is known and fixed beforehand, this mapping can even be attempted at the point of initial user entry, to report any unresolvable ambiguities or solicit additional information to resolve them. To perform matching, C must only convert from S to Z, and only if they are different. Thus, we've resolved as much of the complexity as possible as early on as possible, and reduced the complexity of mapping ambiguities and errors that must be communicated back across the service (C to B), since both S and Z are fixed and known to C. In essence, API implementers only need to handle the complexity and resolve ambiguities involved in mapping from their internal vocabulary to S and back, rather than trying to support and correctly map between any arbitrary vocabulary and their internal vocabulary (where they are much more likely to miss the gotchas of any particular vocabulary)

I expect that I've missed some crucial detail here, so apologies in advance.

Also, your point about clarifying in each response whether terms were recognized and not matched, or unrecognized and ignored, is a good one. We have a general issue to that effect here: https://github.com/MatchmakerExchange/mme-apis/issues/84

pdl commented 9 years ago

Thanks for pointing me to that issue.

I agree that it is very handy to be able to say "you only need to map to S, and only if it is different", so long as we can find an S that can fill that role. But "you may only use S" will is a problem if the mapping Y-S-Z is worse than Y-Z (which to me seems likely).

The biggest problem I can see is that if there is a gene in S FOO123 which is represented as two genes FOO123A and FOO123B in both Y and Z, when B maps FOO123B to FOO123, C may map it to FOO123A and fails to return its (relevant) results for FOO123B but does return some (irrelevant) results for FOO123A. If we prohibit use of any vocabulary but S then it is impossible for B to get the right results from C, even though B is capable of sharing the information in a format which is unambiguous to C.

Similarly, if there is a gene which has no mapping in S, then even if it is present in both Y and Z, B cannot use MME to query C for this data at all.

There is a reverse problem in that if there are two genes in S, BAR456A and BAR456B which Y represents as BAR456, when B maps to S, which should it pick? I'm not sure how practical it will be to get users to resolve ambiguities mapping to both Y and S.

There seem to be three options:

Only one vocabulary is permitted
There is one vocabulary which all servers MUST respond to, but they MAY accept others
There is no restriction on vocabulary

Of these, I think you are suggesting the first. I would suggest the second or third, depending on whether there exists a vocabulary which is sufficiently stable, well-defined, and free from polysemy such that a developer can verify that an implementation is compliant and have confidence that it will remain compliant.

NB: Unless we go with the first. I would recommend a method by which a server responds with its list of supported vocabularies in preference order. This would allow clients to pick the mapping most likely to succeed.

Relequestual commented 9 years ago

Compund this by the fact that some vocabluaries update "whenever" while some have a release cycle of 2-3 months. "whenever" would require people to update nightly, while a longer release cycle will obviously not contain the "latest and gratest" new gene names for some time.

I'm not sure we would be able to mandate one approach over another, because each system has different requirements, and chose one of those options based on their requirements.

fschiettecatte commented 9 years ago

Reviewing this prior to the call this morning, I think it is the best option from @pdl:

There is one vocabulary which all servers MUST respond to, but they MAY accept others.

However I am not keen on adding a method to request vocabularies, the client should provide all the IDs it can and the server can then choose which one to use.

Relequestual commented 9 years ago

On technical call (now), agreed that Ensembl gene ID would be the mandated MUST support vocabulary.

fschiettecatte commented 9 years ago

Here is a sample for a CURIEd Gene ID, the assembly must be included for this option as Ensembl gene ID are assembly sensitive:

"gene" : {
    "id"        : "Ensembl:<Ensembl gene ID>",
    "assembly"  : "NCBI36"|"GRCh37.p13"|"GRCh38.p1"|…"
},

This is a sample for a more 'relaxed' CURIEd Gene ID, the Ensembl gene ID is required and the additional IDs can be provided, along with assembly:

"genes" : [
    {
        "id"        : "Ensembl:<Ensembl gene ID>",
        "assembly"  : "NCBI36"|"GRCh37.p13"|"GRCh38.p1"|…"
    },
    {
        "id"        : "Entrez:<Entrez gene ID>",
        "assembly"  : "NCBI36"|"GRCh37.p13"|"GRCh38.p1"|…"
    },
    {
        "id"        : "HGNC:<HGNC gene ID>"
    },
    …
],

CURIEs:

Ensembl:
Entrez:
HGNC:
Symbol:

These could (should?) be shortened.

Relequestual commented 9 years ago

I don't think they need to be sorted. We should define these CURIEs in the repo. I suggest we follow @cmungall 's approach of defining a jsonld context file like https://github.com/cmungall/biocontext/blob/master/registry/monarch_context.jsonld even if it's just for reference at current.

Relequestual commented 9 years ago

Suggest moving the assembly attribute to the genomic_features level. There would be no reason to send / recieve data which when contained in a genomic_feature object, contained genes from different assemblies!

Moving it to the level above varaints would cover any genes. If multiple assembly data is heald (as it soon will be), I think the assumption was to return whatever assembly was given, however multiple may be given! So when multiple assembly data is held and given in a request, I would expect a new genomic_feature object for a different assembly.

fschiettecatte commented 9 years ago

+1 on pulling the assembly up a level.

fschiettecatte commented 9 years ago

+1 on using @cmungall's definitions. Couple of points, we should rename 'Entrez ' to 'NCBIGene', and in his nomenclature 'HGNC' refers to the ID not the gene name, I did not find anything in his list that referred to gene symbol.

Relequestual commented 9 years ago

I wasn't suggesting we should use his list (as I noted that not everything we might want is there, like HGNC). I guess in theory we should actually be using HGNC IDs as opposed to HGNC Gene Names. But as we're moving to Ensembl for our gene IDs, it's a breaking change regardless.

fschiettecatte commented 9 years ago

Sorry, still early here. I think his nomenclature is fine, I like standardization, he has obviously put in a lot of work into it and I am lazy :) So I am ok with it.

Thinking about the HGNC IDs vs. HGNC Gene Names, makes sense. Gene Names are good for humans, but I prefer IDs in he protocol.

Relequestual commented 9 years ago

While I wasn't suggesting it, I do think it's a good idea! I think there's a case to put this idea forward to have an official GA4GH directory of CURIEs, which everyone can use and agree on. @cmungall seems uniquley suited for this. Thoughts?

fschiettecatte commented 7 years ago

Following the Baltimore MME meeting, the use of CURIES was deferred to 2.0.

For 1.1 the use of Ensembl Gene IDs is strongly encouraged.

For 2.0 the use of Ensembl Gene IDs will be mandatory.

ga4gh / mme-apis

Should we use a single gene ID or use CURIEs to support multiple gene IDs? (Does anyone need entrez id for genes?) #113