Separate genes from proteins?

cbizon commented 6 years ago

Currently we don't differentiate between genes and proteins, we just lump them all under gene. However, most go annotation is at the protein level. When we have a go annotation, and we ask quickgo for related things, we get back a bunch of protein (uniprot) ids. The hope was that synonymizing those would just flip them all back to HGNC and everything would be cleanly handled behind the curtain.

But quickgo, for instance, returns many different uniprot ids corresponding to different protein isoforms of the gene, and only some of those resolve to HGNC ids. A lot of the alternate isoforms have longer identifiers (like A0A0(regular id)) and these don't ever convert to HGNC.

The end result is a bunch of goofy Uniprot nodes that are masquerading as genes and just hanging around for no particular reason. If these were handled more nicely as proteins at least it would be clear in the graph what is going on.

cbizon commented 6 years ago

Note that it looks like biolink gene->function will now handle gene identifiers seamlessly.

cbizon commented 5 years ago

Related to #248 , #393 , #359 I have been thinking about this one because of several related issues, listed above. Just for clarity, and in case anybody wants to chime in, I'm going to dump some thoughts here.

What really happens biologically is that there is a section of DNA that gets transcribed into RNA, modified by splicing (in different ways, even for the same section of dna), and then these transcripts get turned into proteins via translation. In turn, the protein can be modified post-translation. Sometimes, the protein is never created, and just the RNA has a function (or doesn't). So you can generate all different kinds of proteins (and stuff) off of the same section of DNA.

A gene is defined, frankly, somewhat arbitrarily as a collection of these transcripts that all kind of go together, but note that it's complicated because the same gene can have transcripts that start and stop transcription at different (but nearby places). See https://genome.cshlp.org/content/genome/17/6/682.full.html for a consideration of what gene means these days.

So we tend to think about a single gene (which again, is kind of an abstract concept) being capable producing multiple physical chemical entities, some of which are proteins. And it's these "gene products" that actually do stuff.

Now, what makes implementing this understanding is that knowledge sources are inconsistent and (sometimes) not careful about this distinction. There are times when knowledge exists at a protein level, and times when it exists at a gene level, but different knowledge sources may mix them up, or consider them 'the same thing', making it somewhat difficult to know what is actually being asserted.

So the question is, given the uncertainty on what is being asserted, how should we encode this area? There are two basic ways: (gene)-[:has product]->(gene_product, chemical_substance) or (gene,gene_product,chemical_substance)

In the first, we differentiate between genes and their products (which are also chemicals). In the second, we unify genes and gene_products, and therefore chemicals as well (for some chemicals). Our current schema is neither, but (gene,gene_product), (chemical_substance). That is, we don't let genes be chemicals. That makes the connections between e.g. INS (the insulin gene), human insulin (the chemical), and INSR difficult to code. We have: (INS, gene and gene_product), (insulin, chemical_substance) in that our INS node contains both HGNC and UniProt id's, but it's bad because one of those id's should be the same as insulin the chemical substance.

If we go with (INS, gene)-[:has_product]-(insulin, chemical_substance, gene_product) then we make querying more difficult, because you need to know whether knowledge is tied to the gene or the product. For instance, if you are looking for interactions, should you look at gene-gene interactions or protein-protein interactions? What about relationships to disease? What about binding? We think about a chemical binding to gene, but any measurement is actually in an assay based on a particular protein.

If we go with (INS&insulin, gene, gene_product, chemical) then we'll be unifying at the gene level every protein that can be made from one gene. There are cases where we know information about one or the other of these proteins that is true just of that protein, and unifying them all together will be incorrect. The is trivially true of structural information about the protein, but there are also different clinical importance of given proteins, binding, etc.

The upshot is that the more careful version allows more complex and correct statements, but that complexity is felt on the query side.

cbizon commented 5 years ago

There are some wrinkles here, mostly around UniProt. UniProt is a collection of protein (product) identifiers. It's made of up 2 parts, Swiss-Prot, which is hand curated, and Trembl which is not. Within Swiss-prot, the main identifier is supposed to be unique at the gene level. That is, each gene should have a single swiss-prot protein id. Isoforms are sub-ids (so if the swissprot was Q12345, you would have Q12345-1, Q12345-2) etc.

So what does a uniprot id represent? Well, if it's a swissprot, then it should be a gene level annotation, even though it's expressly meant to designate a protein. BUT! There are a number of cases where two ore more swissprots map to the same HGNC. AKAP7 is one example (O43687 and Q9P0M2). And when GO annotates these, it gives different annotations to these two uniprot ids (same gene though).

It's not clear to me that the distinction between the two at the level of GO is really intentional or meaningful though.

So because Uniprots almost map to genes, and because the UniProt id doesn't map to an individual sequence, we might go with putting gene/gene-product/chemical all in the same node. But it isn't really right, and there are cases where it's not right at all. Conceptually it's all wrong, even if it's somewhat practical.

cbizon commented 5 years ago

One of the main uses of UniProts is in GO annotations. It's a bit confusing because there are a few sources. There's the main GO page (amigo) but there's not a service there AFIAK. It only uses a single swiss-prot id per gene (which I would argue makes using proteins instead of genes fairly pointless, but there you go). Then there's quickgo, which seems broader, as it includes annotations of trembl and some of the multiple swiss-prot ids. However, with the right set of filters on a query you can make it reproduce the amigo version:

https://www.ebi.ac.uk/QuickGO/services/annotation/search?proteome=gcrpCan&geneProductType=protein&geneProductSubset=Swiss-Prot&goId=GO%3A1902261&taxonId=9606

And then there's biolink api, which seems like it wraps the amigo version. It only understands the reduced uniprot set. The gene->function returns all go terms (even cellular components). And AFAICT, there isn't a function for going from component->gene.

cbizon commented 5 years ago

Got an email back from UniProt about SwissProt ID's. Basically, the rule that one gene == one uniprot is mostly true, but there are a few cases where they think the sequences are so different that it no longer makes sense. It didn't sound like there was a clear dividing line. I'm not sure that downstream users (e.g. GO annotations) respect this distinction, thought it is represented in HGNC's stuff correctly.

cbizon commented 5 years ago

One more issue relates to the IUPHAR peptides. IUPHAR has a list of peptides, along with links to the gene that makes them, things that they have activity against. Some of those peptides have SMILES, so they get synonymized with other chemicals by unichem.

Some have no identification other than a name, and we can link those to whatever we want by hand

There's a bunch that have 1) An amino acid sequence and 2) A UniProt ID. But here's the fun part: the AA sequence is not any of the sequences that make up the UniProt (at least not always). It's usually or often a piece of the protein sequence. So the protein I guess gets chopped up further to make the peptide. I think in this case, you'd want the IUPHAR peptide to still be a product of the gene, but separate from the UniProt but maybe with an association to the uniprot of derives_into or something like that.

cbizon commented 5 years ago

The final upshot here is that even though it is somewhat rare that we care, and even though many or most services elide it, there is a difference between genes and gene products. If we lump them together, we will at some point create errors. The complexity we create is probably tolerable at query side because the schema will generally keep specific information at either a gene or protein level. e.g. functions are at the proteins, disease at the genes.

Because it is the main source of products, this will mostly be based on uniprot, and we'll focus on swiss-prot for intial loading, and for functional annotation.

With this structure we can represent IUPHAR peptides as independent entities where appropriate, and in the future, if we want nodes for particular isoforms from uniprot, entries from trembl, etc, we have the framework to include them.

cbizon commented 5 years ago

Steps:

[x] Add gene_product type to builder
[x] Fix cache synonymizer & regular synonymizer to treat genes and gene products independently and let gene products synonymize with chemicals
[ ] Update gene synonymizer tests & add gene product synonymizer tests
[ ] gene product annotator
[x] Update writer/builder to be more expressive with node labels (i.e. every gene product should also have the chemical label, every disease should also have the disease or phenotype label) This will also take care of #109 and #272
[x] Load synonymizer with swiss-prot and iuphar for gene products
[ ] Modify clients/routing to move some services from gene to gene product
[ ] Make gene -> gene product services from iuphar, uniprot, hgnc (need all? Some?)

cbizon commented 5 years ago

UniProt contains a bunch of other identifiers for proteins. It does, as noted above mix genes/products, so here are some notes about what's there and what we think is valid to merge Allergome: Familys, skip for now BioCyc: Links as enzyme, skip for now BioGrid: Get BioMuta: Mutation db, skip CCDS: insoform dependent, skip ChEMBL: Chembl Target id. Get. (different prefix?) chitars: Some gene thing, skip ComplexPortal: Complexes, skip CPTAC: protein/peptide assays. Probably good, but the mapping is not perfect, so skip for now CRC64: ? skip DIP: Protein interactions, Get DisProt: Protein Disorder, Skip DMDM: Mutations, gene level? Skip DNASU: Plasmid repo, skip DrugBank: Links to chemical interactors, not a synonymous id. Skip eggNOG: phylogenetics, skip EMBL: isoform level, skip EMBL-CDS: isoform level, skip Ensembl: isoform level, skip Ensembl_PRO: isoform level, skip Ensembl_TRS: isoform level, skip ESTHER: protein family level, skip GeneCards, GeneDB, GeneID, Gene_Name, Gene_ORFName, GeneReviews, Gene_Synonym: Gene level, skip GeneTree, GeneWiki, GenomeRNAi: Gene level, skip GI: isoform level, skip GlyConnect: potentially interesting, check uniqueness GuidetoPHARMCOLOGY: At least some go to families, skip HGNC: Gene level, skip HOGENOM: phylogenomic, skip HPA: Gene Level, skip KEGG: This one is important! I think we get it, and make sure we don't put it into gene (KEGG doesn't distinguish, yay!) KO: phylo, skip MEROPS: gene families, skip MIM: models skip MINT: PPI, get NCBI_TaxID: taxon, skip neXtProt: Get OMA: phylo, skip Orphanet: disease links, skip OrthoDB: phylo, skip PATRIC: rare, skip PDB: isoform/complex level: skip PeroxiBase: families? PharmGKB: Gene level, skip ProteomicsDB: Isoform level, skip Reactome: Link to pathways, skip REBASE: families,skip RefSeq: skip RefSeq_NT: skip STRING: Get SwissLipids: Lipid reactions, but they're just coming from RHEA, skip TCDB: Family, skip TreeFam: Family, skiop UCSC: isoform level, skip UniParc: isoform level, skip UniPathway: Dead, skip UniProtKB-ID: get UniRef100: Get UniRef50: Get UniRef90: Get

cbizon commented 5 years ago

For synonymization of proteins it would be very convenient if the identifiers we liked were 1:1 with UniProt Ids. There are two ways that this can fail. The identifiers can be 1:N or N:1 with uniprot (or N:M, but that's implicit).

Identifiers that are 1:1 MINT neXtProt String UniprotKB-ID PRO (not listed, but uses Uniprot ids)

1 UniProt ID, Multiple Other BioGrid chembl DIP GlyConnect KEGG

One Other ID, multiple UniProts BioGrid chembl GlyConnect KEGG UniRef100 UniRef90 UniRef50

cbizon commented 5 years ago

And here's one really interesting thing that can happen, and that really justifies UniProt as a protein rather than gene database. Different genes can produce the same protein. maybe there are multiple copies of a gene, and they've all been given different ids, but they all produce the same amino acid sequence. For example, UniProt has A1L429, which maps to four different HGNC identifiers. (GAGE12B, GAGE12C, GAGE12D, GAGE12E).

KEGG, for instance, has different identifiers, matching the genes, but they all have the same amino acid sequence, so they are all mapped to the same UniProt ID. There are 216 such things in KEGG.

Note that KEGG id's are the hsa id's not the compound ids. So really, I don't think that we're interested in mapping to them anyway, but this will likely persist for other identifier systems.

cbizon commented 5 years ago

BioGrid is similar to KEGG, in that its identifiers appear to be gene level, rather than protein level (which in an aside, seems kind of baffling for a PPI db but whatever). There's also an additional wrinkle, which is that the UniProt mappings seem odd:

P0C0L4 is C4A, and P0C0L5 is C4B (complement 4a and 4b). They're different genes, and they have different HGNC identifiers, and different BioGrid entries (107181 and 107182). BioGrid appears to link to the correct uniprots from its side, but for some reason, uniprot links to both 1 and 2 from P0C0L4, so that now 107182 is linked to both of these uniprots from the uniprot side.

Upshot: No BioGrid, at least using uniprot's mappings. And when we do bring it in, it will be at gene-level.

cbizon commented 5 years ago

ChEMBL ids are for targets, but the association here is not an equivalence, but a part of. That is, if the uniprot id is a component of a complex, and the complex is a chembl target then there will be a link. not a good value for synonymization.

cbizon commented 5 years ago

Looking at GlyConnect, it's pretty clear that 1) I don't know exactly what the id's represent, but 2) they're not 1:1 with uniprot ids. Drop.

cbizon commented 5 years ago

DIP is 1:N, that is UniProt IDs can map to more than one DIP. It looks like what happens is that DIP is behind UniProt in terms of identifiers. As IDs get merged in UniProt, each overall uniprot entry can have multiple accessions, though only one is primary. If you use a secondary one in the browser, it just maps to the first one. But there are entries in DIP that include both the old and new identifiers. From uniprot's side, that means that the DIP entries are both mapping back to the same primary uniprot.

I think for now we don't worry about DIP either, since we don't really know if we're ever going to want that database....

cbizon commented 5 years ago

The UniRefs are sequence clusters, so of course the id's aren't 1:1... Don't use them.

cbizon commented 5 years ago

One really unpleasant part of this is that different databases use genes and proteins for the same types of things. So for instance,

CTD has chemical/gene interactions drugbank has chemical/protein interactions

If we follow the sources, then we are really complicating the query because I don't know whether to look for (disease)-(gene)-(chemical) or (disease)-(gene)-(geneproduct)-(chemical) and I probably have to look for both. So it almost certainly needs to be handled in the rank/query.

cbizon commented 5 years ago

The fact that genes and proteins are treated so cavalierly by data sources has led me to rethink this a small amount. There are two things that really make this tricky.

noting that e.g. some KS code interactions as chem-gene and some as chem-protein
UniProt id's are really at the gene level. There are sub id's (PROs and -1 -2) that talk about particular sequences that you could argue are protein level.

If I look at the chemical concordance. The only entities that get both a UniProt and a Chebi/chembl/pubchem are things that have a UniProtKB#PRO identifier, i.e. that talk about an actual sequence, like a peptide.

So UniProts much more naturally go with genes than with chems. It's funky because UniProts also line up with PR identifiers (which is a protein ontology!) but they are really genes.

So if we keep uniprots bundled with hgncs, do we call the resulting things genes, gene products, or gene_or_gene_products? The last is probably the most correct, but there's a lot of stuff that already calls that entity a gene, and it's not worth changing if we're not making any radical changes to the way these are encoded.

So for now

NCATS-Gamma / robokop

Separate genes from proteins? #77