Do we lose some knowledge sources that might connect ~54K genes to other bioentities?

chunyuma commented 3 years ago

Hi KG2team (@saramsey, @kvarforl, @ericawood) again,

I just found one more potential issue for KG2 (Sorry for reporting many KG2 issues recently). I discussed with @dkoslicki and decided to open this issue to report it to you.

I investigated this based on KG2.5.2c but it should be associated with KG2.5.2. Currently, I found that we have around ~54K biolink:Gene nodes only connected to 'Homo sapiens' curie (CHEMBL.TARGET:CHEMBL372) but not connected to any other nodes.

The reason why I did this investigation is because I found there are only ~4K biolink:Gene curies which are from HGNC, NCBIGene and Ensembl and also have gene sequence information (although we actually have plenty of genes in KG2 having gene sequence information from these three sources (eg. ‘NCBIGene’: 59060, ‘ENSEMBL’: 67929, ‘HGNC’: 40311)) after I excluded some node types (Please see below) in KG2c for simplifying KG in order to get better explanation for DTD model.

In the figure below, the node types with black color are the ones that I excluded.

So then I investigated why there are so many genes lost after I excluded these node types. I suspect that they might be only connected to the node types that I excluded.

Here is my investigation result:

As you can see, all these lost genes (53734) are connected to biolink:OrganismTaxon and within biolink:OrganismTaxon (Please see below), almost all of them are only connected to Home sapiens. This means that if this Home sapiens curie is excluded, then all these genes become isolated nodes.

@dkoslicki thinks that perhaps some knowledge sources got dropped somewhere that would connect those genes to other bioentities? Could you please help us take a look for this issue? Thank you so much!

saramsey commented 3 years ago

Hi @chunyuma, thank you for bringing this to my attention. Can you please give me an example of a single gene that you feel may have had an associated edge dropped? Just one gene is sufficient for me to get started.

saramsey commented 3 years ago

Also, can you please paste in here the Cypher query that you used to obtain your list of 54k genes?

saramsey commented 3 years ago

This Cypher query

match (n:`biolink:Gene`)-[:`biolink:in_taxon`]->(m) with n, size((n)-[]->()) as degree where degree=1 return count(*)

run on kg2canonicalized2.rtx.ai gives me 56,993 nodes. I wonder if that is not the KG2.5.2c server.

saramsey commented 3 years ago

Ah, maybe this is the Cypher query that you ran:

match (n:`biolink:Gene`)-[:`biolink:in_taxon`]->(m {name: 'Homo sapiens'}) with n, size((n)-[]->()) as degree where degree=1 return count(*)

It gives 53,976 results. Is that the query that you ran?

Also, @chunyuma, for future KG2/KG2c related bug reports, please state exactly which Neo4j endpoint you are using (i.e., endpoint hostname). Thanks.

saramsey commented 3 years ago

Even when I restate the above cypher query more precisely as

match (n:`biolink:Gene`)-[:`biolink:in_taxon`]->(m {id: 'CHEMBL.TARGET:CHEMBL372'}) with n, size((n)-[]->()) as degree where degree=1 return count(*)

and when I run it on kg2c-5-2.rtx.ai, I still get 53,976 nodes, not 53,734 as reported above. I think it would be helpful to see the original Cypher query that resulted in 53,734 nodes.

dkoslicki commented 3 years ago

In any case, that number of degree 1 genes, and then only connected to the "human" node seems concerning. Do we really know about pathways involving ~4k genes? I had a feeling that we used to be more gene-centric, so was wondering if a knowledge source got dropped that connected up these degree 1 genes.

But I'll let @chunyuma provide the details about his cypher command, endpoint, and examples

chunyuma commented 3 years ago

Hi @saramsey,

Also, @chunyuma, for future KG2/KG2c related bug reports, please state exactly which Neo4j endpoint you are using (i.e., endpoint hostname). Thanks.

I'm sorry that I didn't state which Neo4j endpoint that I'm now using before. I'm using kg2endpoint-kg2-5-2.rtx.ai for KG2 and kg2c-5-2.rtx.ai for KG2c.

Also, can you please paste in here the Cypher query that you used to obtain your list of 54k genes?

As for the cypher query I used, I actually used both cypher query and python code to get those values. Here is the python cod e with Cypher query:

check_types = ['biolink:OrganismTaxon','biolink:IndividualOrganism','biolink:NamedThing','biolink:MolecularEntity','biolink:AnatomicalEntity','biolink:InformationContentEntity','biolink:GenomicEntity','biolink:ClinicalIntervention','biolink:PhysicalEntity','biolink:Device','biolink:BiologicalProcess','biolink:Phenomenon','biolink:Activity','biolink:QuantityValue','biolink:GeographicLocation','biolink:RelationshipType','biolink:Agent','biolink:MaterialSample','biolink:BiologicalEntity','biolink:PopulationOfIndividualOrganisms','biolink:ClinicalModifier','biolink:OntologyClass','biolink:EnvironmentalFeature','biolink:PhenotypicQuality','biolink:EnvironmentalProcess','biolink:ExposureEvent','biolink:LifeStage','biolink:BiologicalProcessOrActivity','biolink:OrganismalEntity','biolink:FrequencyValue']
node_type_list = []
for node_type in check_types:
    count1 = int(conn.query(f"match (n)-[]-(m) where n.category='{node_type}' and m.id in {list(gene_id_list['0'])} return count(distinct n.id)")[0])
    count2 = int(conn.query(f"match (n)-[]-(m) where n.category='{node_type}' and m.id in {list(gene_id_list['0'])} return count(distinct m.id)")[0])
    node_type_list.append((node_type,count1,count2))

And here is a list of genes from those ~54K genes which only connected to a limited biolink:OrganismTaxon curies I showed above (Sorry, I can paste all of them here so I might just paste some of them for you to review):

ENSEMBL:ENSG00000285055',
 'ENSEMBL:ENSG00000282118',
 'NCBIGene:105370826',
 'ENSEMBL:ENSG00000272240',
 'NCBIGene:107985698',
 'ENSEMBL:ENSG00000259682',
 'NCBIGene:107161230',
 'NCBIGene:729494',
 'NCBIGene:100420946',
 'NCBIGene:111429608',
 'NCBIGene:105377218',
 'ENSEMBL:ENSG00000225818',
 'NCBIGene:100270753',
 'ENSEMBL:ENSG00000282401',
 'NCBIGene:105371551',
 'ENSEMBL:ENSG00000276205',
 'ENSEMBL:ENSG00000279749',
 'ENSEMBL:ENSG00000236796',
 'NCBIGene:107986473',
 'ENSEMBL:ENSG00000257327',
 'NCBIGene:106146148',
 'NCBIGene:100130002',
 'NCBIGene:112268400',
 'NCBIGene:387491',
 'NCBIGene:105379389',
 'NCBIGene:116225297',
 'NCBIGene:112872298',
 'ENSEMBL:ENSG00000205018',
 'NCBIGene:728851',
 'NCBIGene:106481413',
 'NCBIGene:101928370',
 'ENSEMBL:ENSG00000237525',
 'NCBIGene:110120974',
 'ENSEMBL:ENSG00000273553',
 'ENSEMBL:ENSG00000248399',
 'ENSEMBL:ENSG00000259103',
 'ENSEMBL:ENSG00000216966',
 'ENSEMBL:ENSG00000267714',
 'ENSEMBL:ENSG00000286049',
 'NCBIGene:100418923',
 'ENSEMBL:ENSG00000233221',
 'NCBIGene:112695107',
 'ENSEMBL:ENSG00000280341',
 'ENSEMBL:ENSG00000282875',
 'NCBIGene:105371093',
 'ENSEMBL:ENSG00000272991',
 'NCBIGene:105376304',
 'NCBIGene:645693',
 'NCBIGene:107986993',
 'NCBIGene:440131',
 'ENSEMBL:ENSG00000227397',
 'NCBIGene:106480378',
 'ENSEMBL:ENSG00000230552',
 'NCBIGene:112163663',
 'ENSEMBL:ENSG00000277290',
 'NCBIGene:105379361',
 'NCBIGene:105372362',
 'NCBIGene:107985915',
 'NCBIGene:106481269',
 'ENSEMBL:ENSG00000236118',
 'ENSEMBL:ENSG00000234921',
 'NCBIGene:100874197',
 'NCBIGene:105378385',
 'ENSEMBL:ENSG00000228143',
 'NCBIGene:105374495',
 'ENSEMBL:ENSG00000282843',
 'NCBIGene:107988033',
 'NCBIGene:100271336',
 'ENSEMBL:ENSG00000285709',
 'ENSEMBL:ENSG00000287519',
 'ENSEMBL:ENSG00000287985',
 'NCBIGene:109623480',
 'NCBIGene:112540016',
 'NCBIGene:106479247',
 'ENSEMBL:ENSG00000288097',
 'ENSEMBL:ENSG00000267160',
 'NCBIGene:110121119',
 'ENSEMBL:ENSG00000277007',
 'ENSEMBL:ENSG00000253796',
 'NCBIGene:112268067',
 'NCBIGene:102724813',
 'ENSEMBL:ENSG00000233625',
 'ENSEMBL:ENSG00000287268',
 'NCBIGene:342808',
 'NCBIGene:105376348',
 'NCBIGene:100132683',
 'NCBIGene:107986683',
 'NCBIGene:105376473',
 'NCBIGene:100289265',
 'NCBIGene:100128493',
 'ENSEMBL:ENSG00000274447',
 'NCBIGene:105369686',
 'NCBIGene:113875020',
 'ENSEMBL:ENSG00000250240',
 'NCBIGene:105374003',
 'NCBIGene:105375616',
 'NCBIGene:344866',
 'NCBIGene:111562370',
 'ENSEMBL:ENSG00000235277',
 'NCBIGene:780812',
 'NCBIGene:285501',
 'NCBIGene:105370804',
 'ENSEMBL:ENSG00000270202',
 'ENSEMBL:ENSG00000248918',
 'NCBIGene:100169767',
 'ENSEMBL:ENSG00000231591',
 'ENSEMBL:ENSG00000243491',
 'NCBIGene:112997569',
 'ENSEMBL:ENSG00000283657',
 'ENSEMBL:ENSG00000286782',
 'NCBIGene:340512',
 'NCBIGene:152709',
 'ENSEMBL:ENSG00000279766',
 'NCBIGene:111413041',
 'ENSEMBL:ENSG00000275437',
 'NCBIGene:107986932',
 'NCBIGene:341689',
 'NCBIGene:650983',

Here is one example that I queried on kg2c-5-2.rtx.ai for ENSEMBL:ENSG00000285055:

Can you please give me an example of a single gene that you feel may have had an associated edge dropped? Just one gene is sufficient for me to get started.

Regarding an example that I feel may have had an associated edge dropped, I might need to do more investigation. But based on DRKG, we should have many genes that connected to gene itself, compound, pathway and etc.

saramsey commented 3 years ago

we should have many genes that connected to gene itself,

Why would one expect there to be many cases of a biolink:Gene connected to itself? I mean, I can see that for biolink:Protein nodes that are homodimers. But genes?

saramsey commented 3 years ago

But based on DRKG, we should have many genes that connected to gene itself, compound, pathway and etc.

Alright, it is certainly fair to speculate. I'd like just one example, please, in order to get started. Thank you.

chunyuma commented 3 years ago

Hi @saramsey, I meant gene-gene relationship based on the below info from DRKG:

Entity-type pair	Drugbank	GNBR	Hetionet	STRING	IntAct	DGIdb	Bibliography	Total interactions
(Gene, Gene)	-	66,722	474,526	1,496,708	254,346	-	58,629	2,350,931
(Compound, Gene)	24,801	80,803	51,429	-	1,805	26,290	25,666	210,794
(Disease, Gene)	-	95,399	27,977	-	-	-	461	123,837
(Atc, Compound)	15,750	-	-	-	-	-	-	15,750
(Compound, Compound)	1,379,271	-	6,486	-	-	-	-	1,385,757
(Compound, Disease)	4,968	77,782	1,145	-	-	-	-	83,895
(Gene, Tax)	-	14,663	-	-	-	-	-	14,663
(Biological Process, Gene)	-	-	559,504	-	-	-	-	559,504
(Disease, Symptom)	-	-	3,357	-	-	-	-	3,357
(Anatomy, Disease)	-	-	3,602	-	-	-	-	3,602
(Disease, Disease)	-	-	543	-	-	-	-	543
(Anatomy, Gene)	-	-	726,495	-	-	-	-	726,495
(Gene, Molecular Function)	-	-	97,222	-	-	-	-	97,222
(Compound, Pharmacologic Class)	-	-	1,029	-	-	-	-	1,029
(Cellular Component, Gene)	-	-	73,566	-	-	-	-	73,566
(Gene, Pathway)	-	-	84,372	-	-	-	-	84,372
(Compound, Side Effect)	-	-	138,944	-	-	-	-	138,944
Total	1,424,790	335,369	2,250,197	1,496,708	256,151	26,290	84,756	5,874,261

saramsey commented 3 years ago

Do we really know about pathways involving ~4k genes?

Your point is well taken. Biologically speaking, it has to be higher than that. There may well be a bug. Please bear with me, KG2c is new to me and I am still unfamiliar with its construction.

saramsey commented 3 years ago

Hi @saramsey, I meant drug-drug relationship based on the below info from DRKG:

Wait, why are we talking about drug-drug relationships?

chunyuma commented 3 years ago

Hi @saramsey, I meant drug-drug relationship based on the below info from DRKG:

Wait, why are we talking about drug-drug relationships?

Oh, I'm sorry @saramsey, I made a typo again. There are lots of stuff going on today that makes me make a lot of typo. It should be gene-gene relationship rather than drug-drug relationship. Sorry for making you confused. We should have many gene-gene relationship.

Entity-type pair	Drugbank	GNBR	Hetionet	STRING	IntAct	DGIdb	Bibliography	Total interactions
(Gene, Gene)	-	66,722	474,526	1,496,708	254,346	-	58,629	2,350,931

saramsey commented 3 years ago

I think the following Cypher query (against kg2endpoint-kg2-5-2.rtx.ai) shows that there are 1.35M distinct ordered pairs of nodes of category biolink:Protein that are directly connected by a relationship:

match (n:`biolink:Protein`)-[r]->(m:`biolink:Protein`) with n.id as foo, m.id as bar, count(*) as ignored return count(*);

saramsey commented 3 years ago

How many protein nodes are there that participate in protein-protein edges in KG2.5.2? I'd estimate about 273.5k, based on this cypher query:

match (n:`biolink:Protein`)-[r]-(m:`biolink:Protein`) with n.id as foo return count(*);

saramsey commented 3 years ago

Here is a pair of uniprot proteins that have a relationship between them in KG2.5.2, selected somewhat arbitrarily by this Cypher query:

match (n:`biolink:Protein` {provided_by: 'identifiers_org_registry:uniprot'})-[r]-(m:`biolink:Protein` {provided_by: 'identifiers_org_registry:uniprot'}) return n.id, m.id limit 1;

the pair is:

saramsey commented 3 years ago

That pair of Uniprot proteins is connected by a biolink:regulates relation, as shown here:

saramsey commented 3 years ago

Let's check if that relationship made it into KG2c-2.5.2. Sure enough, here it is:

saramsey commented 3 years ago

Note that in KG2c-2.5.2, both node UniProtKB:Q6UXQ4 and node UniProtKB:Q12948 are of category biolink:Protein not category biolink:Gene:

saramsey commented 3 years ago

but both UniProtKB:Q6UXQ4 and UniProtKB:Q12948 should be picked up if you reference them using biolink:Gene in the label, as shown here:

dkoslicki commented 3 years ago

Interesting! Given we conflate genes with proteins, is it possibly the case that the synonymizer is not conflating them and so they lose the connections they should (?) inherit from the protein-[]-({something else}) nodes? Perhaps @amykglen or @edeutsch could chime in about these degree 1 genes in KG2C

saramsey commented 3 years ago

I just ran a query on KG2c-2.5.2 and it shows 1.849M relationships between biolink:Gene and biolink:Gene labeled nodes:

You can try the Cypher out for yourself:

match (n:`biolink:Gene`)-[r]->(m:`biolink:Gene`) return count(*)

saramsey commented 3 years ago

there are only ~4K biolink:Gene curies which are from HGNC, NCBIGene and Ensembl and also have gene sequence information

OK, sure, if you can give me an example of a CURIE ID of a node that you feel (based on UniProtKB) should have an amino acid sequence attached to it (but does not have it in KG2c), I will be glad to figure out where we went wrong.

saramsey commented 3 years ago

Also, I note that there are 130,519 nodes in KG2c-2.5.2 with label biolink:Gene that are connected to Homo sapiens,

match (n:`biolink:Gene`)-[r]->(m:`biolink:OrganismTaxon` {name: 'Homo sapiens'}) with n.id as foo return count(*)

so is it so suprising that only 54k (about 41%) of them are only connected to Homo sapiens?

saramsey commented 3 years ago

So, in KG2c-2.5.2, I'm seeing 8,000 nodes with label biolink:Gene that are connected to a node of label biolink:Pathway, so we are well beyond 4,000:

match (n:`biolink:Gene`)-[r]->(m:`biolink:Pathway`) return count(distinct n.id)

saramsey commented 3 years ago

I checked the first gene that @chunyuma provided on his list of only-connected-to-human-taxon genes, ENSEMBL:ENSG00000285055, and sure enough, it is only connected to "Homo sapiens" in KG2.5.2:

saramsey commented 3 years ago

Checking the third node that @chunyuma provided, NCBIGene:105370826, and same thing; in KG2.5.2, it is, in fact, only connected to "Homo sapiens":

saramsey commented 3 years ago

Checking another one, selected at random from @chunyuma's list, NCBIGene:780812:

It has two neighbors in KG2.5.2, but note that one of them is connected by biolink:same_as, so those two yellow nodes should be identified in KG2c, right? Again, I'm not seeing a problem here.

chunyuma commented 3 years ago

so is it so suprising that only 54k (about 41%) of them are only connected to Homo sapiens?

OK, @saramsey, sorry for making you confused about these 54K nodes. But one thing I want to point out is that in kg2c endpoint, @edeutsch and @amykglen designed what is called expanded_category which includes all super categories. So to find number of nodes with label biolink:Gene that are connected to Homo sapiens, we should use:

match (n)-[r]->(m:`biolink:OrganismTaxon` {name: 'Homo sapiens'}) where n.category='biolink:Gene' with n.id as foo return count(*)

It is actually around 59090 (~54K + 4K)

Otherwise it will includes other nodes with label of super categories of biolink:Gene.

@dkoslicki, based on my investigation, many of these ~54K biolink:Gene nodes are pseudo gene or the genes which only have transcripts but are not translated into proteins. I think this might explain why they don't connect to biolink:Protein. But not sure if these genes should connect to some pathways.

Here are a few examples:

chunyuma commented 3 years ago

@saramsey, one thing makes me curious is that since most of these ~54K genes should be noncoding genes, they should connect to some NoncodingRNAProduct curies, right? I think KG2.5.2 has already had NoncodingRNAProduct curies.

Take ENSG00000205018 as an example:

It should connect to a lncRNA ENST00000378347.2:

saramsey commented 3 years ago

OK, so using this Cypher query against KG2.5.2,

with ['ENSEMBL:ENSG00000285055',
 'ENSEMBL:ENSG00000282118',
 'NCBIGene:105370826',
 'ENSEMBL:ENSG00000272240',
 'NCBIGene:107985698',
 'ENSEMBL:ENSG00000259682',
 'NCBIGene:107161230',
 'NCBIGene:729494',
 'NCBIGene:100420946',
 'NCBIGene:111429608',
 'NCBIGene:105377218',
 'ENSEMBL:ENSG00000225818',
 'NCBIGene:100270753',
 'ENSEMBL:ENSG00000282401',
 'NCBIGene:105371551',
 'ENSEMBL:ENSG00000276205',
 'ENSEMBL:ENSG00000279749',
 'ENSEMBL:ENSG00000236796',
 'NCBIGene:107986473',
 'ENSEMBL:ENSG00000257327',
 'NCBIGene:106146148',
 'NCBIGene:100130002',
 'NCBIGene:112268400',
 'NCBIGene:387491',
 'NCBIGene:105379389',
 'NCBIGene:116225297',
 'NCBIGene:112872298',
 'ENSEMBL:ENSG00000205018',
 'NCBIGene:728851',
 'NCBIGene:106481413',
 'NCBIGene:101928370',
 'ENSEMBL:ENSG00000237525',
 'NCBIGene:110120974',
 'ENSEMBL:ENSG00000273553',
 'ENSEMBL:ENSG00000248399',
 'ENSEMBL:ENSG00000259103',
 'ENSEMBL:ENSG00000216966',
 'ENSEMBL:ENSG00000267714',
 'ENSEMBL:ENSG00000286049',
 'NCBIGene:100418923',
 'ENSEMBL:ENSG00000233221',
 'NCBIGene:112695107',
 'ENSEMBL:ENSG00000280341',
 'ENSEMBL:ENSG00000282875',
 'NCBIGene:105371093',
 'ENSEMBL:ENSG00000272991',
 'NCBIGene:105376304',
 'NCBIGene:645693',
 'NCBIGene:107986993',
 'NCBIGene:440131',
 'ENSEMBL:ENSG00000227397',
 'NCBIGene:106480378',
 'ENSEMBL:ENSG00000230552',
 'NCBIGene:112163663',
 'ENSEMBL:ENSG00000277290',
 'NCBIGene:105379361',
 'NCBIGene:105372362',
 'NCBIGene:107985915',
 'NCBIGene:106481269',
 'ENSEMBL:ENSG00000236118',
 'ENSEMBL:ENSG00000234921',
 'NCBIGene:100874197',
 'NCBIGene:105378385',
 'ENSEMBL:ENSG00000228143',
 'NCBIGene:105374495',
 'ENSEMBL:ENSG00000282843',
 'NCBIGene:107988033',
 'NCBIGene:100271336',
 'ENSEMBL:ENSG00000285709',
 'ENSEMBL:ENSG00000287519',
 'ENSEMBL:ENSG00000287985',
 'NCBIGene:109623480',
 'NCBIGene:112540016',
 'NCBIGene:106479247',
 'ENSEMBL:ENSG00000288097',
 'ENSEMBL:ENSG00000267160',
 'NCBIGene:110121119',
 'ENSEMBL:ENSG00000277007',
 'ENSEMBL:ENSG00000253796',
 'NCBIGene:112268067',
 'NCBIGene:102724813',
 'ENSEMBL:ENSG00000233625',
 'ENSEMBL:ENSG00000287268',
 'NCBIGene:342808',
 'NCBIGene:105376348',
 'NCBIGene:100132683',
 'NCBIGene:107986683',
 'NCBIGene:105376473',
 'NCBIGene:100289265',
 'NCBIGene:100128493',
 'ENSEMBL:ENSG00000274447',
 'NCBIGene:105369686',
 'NCBIGene:113875020',
 'ENSEMBL:ENSG00000250240',
 'NCBIGene:105374003',
 'NCBIGene:105375616',
 'NCBIGene:344866',
 'NCBIGene:111562370',
 'ENSEMBL:ENSG00000235277',
 'NCBIGene:780812',
 'NCBIGene:285501',
 'NCBIGene:105370804',
 'ENSEMBL:ENSG00000270202',
 'ENSEMBL:ENSG00000248918',
 'NCBIGene:100169767',
 'ENSEMBL:ENSG00000231591',
 'ENSEMBL:ENSG00000243491',
 'NCBIGene:112997569',
 'ENSEMBL:ENSG00000283657',
 'ENSEMBL:ENSG00000286782',
 'NCBIGene:340512',
 'NCBIGene:152709',
 'ENSEMBL:ENSG00000279766',
 'NCBIGene:111413041',
 'ENSEMBL:ENSG00000275437',
 'NCBIGene:107986932',
 'NCBIGene:341689',
 'NCBIGene:650983'] as cids 
  match (n:`biolink:Gene`) with n, size((n)-->()) as degree where n.id in cids and degree > 1 with n.id as pnids  

 match (m:`biolink:Gene`)-[r]->(z) where m.id in pnids and type(r) <> 'biolink:same_as' and type(r) <> 'biolink:in_taxon'
 return m.id, type(r)

I checked every one of the node IDs that you pasted above. They all are either degree 1 (connected only to "Homo sapiens") or if they are degree 2 or above, it is entirely due to "same_as" connections to other genes that should be consolidated to a single gene in KG2c. Thus, from KG2.5.2 standpoint, your list checks out. They all should be solely connected to "Homo sapiens" in KG2c, at least as far as the canonicalization and synonymization code is concerned.

saramsey commented 3 years ago

OK @chunyuma, thank you for bringing the node ENSEMBL:ENSG00000205018 to my attention. Indeed, it is connected only to "Homo sapiens" in KG2:

Chunyu wrote:

Take ENSG00000205018 as an example: It should connect to a lncRNA ENST00000378347.2:

And indeed, it has a transcript ENST00000378347.2 annotated in Ensembl (and has had one since Ensembl release 76). But contrary to your claim, ENST00000378347.2 is not a lncRNA, it is a protein-coding gene. See for yourself:

chunyuma commented 3 years ago

Thanks @saramsey to look into this.

I think one possible that these ~54 genes are only connected to Homo sapiens although they have sequence information from HGNC, NCBIGene and Ensembl is because currently KG2 doesn't have their noncoding RNA info yet (at least not from HGNC, NCBIGene and Ensembl):

match (n:`biolink:NoncodingRNAProduct`) return distinct n.provided_by

Therefore, although some of them might have regulation function but the regulated relationship should be from its corresponding noncodingRNA.

saramsey commented 3 years ago

More to the point, for Ensembl genes, we don't currently map those genes to Ensembl proteins in our ETL process, as you can see from this script: https://github.com/RTXteam/RTX/blob/master/code/kg2/ensembl_json_to_kg_json.py

If you would like for the KG2 build system to do that, please open a separate issue (tagged kg2) and request that we modify the Ensembl ETL script in order to map Ensembl Genes to Ensembl Proteins. I think it may be doable, but I need to check. Actually @ericawood can check.

saramsey commented 3 years ago

Yes, KG2 is light on lncRNA information, from the Ensembl standpoint. It is best to open a request for a specific source, i.e., if for your DTD work you need Gene->RNA for Ensembl, open a ticket for that.

chunyuma commented 3 years ago

And indeed, it has a transcript ENST00000378347.2 annotated in Ensembl (and has had one since Ensembl release 76). But contrary to your claim, ENST00000378347.2 is not a lncRNA, it is a protein-coding gene. See for yourself:

It is weird. Based on the existing record of this gene on ensembl here, it is a non-protein-coding gene. Note that this is based on Ensembl release 103.

chunyuma commented 3 years ago

Yes, KG2 is light on lncRNA information, from the Ensembl standpoint. It is best to open a request for a specific source, i.e., if for your DTD work you need Gene->RNA for Ensembl, open a ticket for that.

Thanks @saramsey, I will open another issue for the request of doing ETL for Gene->RNA for Ensembl.

saramsey commented 3 years ago

And indeed, it has a transcript ENST00000378347.2 annotated in Ensembl (and has had one since Ensembl release 76). But contrary to your claim, ENST00000378347.2 is not a lncRNA, it is a protein-coding gene. See for yourself:

It is weird. Based on the existing record of this gene on ensembl here, it is a non-protein-coding gene. Note that this is based on Ensembl release 103.

OK, we are both (sort of) right, LOL. But you are more correct. Looking more closely, it was classified as a protein-coding gene in Ensembl release 76. In the current Ensembl, it is a lncRNA. Sorry, my mistake.

chunyuma commented 3 years ago

And indeed, it has a transcript ENST00000378347.2 annotated in Ensembl (and has had one since Ensembl release 76). But contrary to your claim, ENST00000378347.2 is not a lncRNA, it is a protein-coding gene. See for yourself:

It is weird. Based on the existing record of this gene on ensembl here, it is a non-protein-coding gene. Note that this is based on Ensembl release 103.

OK, we are both (sort of) right, LOL. But you are more correct. Looking more closely, it was classified as a protein-coding gene in Ensembl release 76. In the current Ensembl, it is a lncRNA. Sorry, my mistake.

So @saramsey, do you think we need to update the release version of Ensembl that we're currently using on KG2?

saramsey commented 3 years ago

It is not surprising that most pseudogenes have little in the way of functional annotations. Although the point is debatable, many biologists regard pseudogenes as non-functional in general. It's sort of in their definition. Sort of. Gets into details about "processed" vs. "unprocessed" pseudogenes, etc.

But of course, lots of noncoding transcripts are thought to be functional. Antisense, lncRNAs, etc. Lots of them.

saramsey commented 3 years ago

And indeed, it has a transcript ENST00000378347.2 annotated in Ensembl (and has had one since Ensembl release 76). But contrary to your claim, ENST00000378347.2 is not a lncRNA, it is a protein-coding gene. See for yourself:

It is weird. Based on the existing record of this gene on ensembl here, it is a non-protein-coding gene. Note that this is based on Ensembl release 103.

OK, we are both (sort of) right, LOL. But you are more correct. Looking more closely, it was classified as a protein-coding gene in Ensembl release 76. In the current Ensembl, it is a lncRNA. Sorry, my mistake.

So @saramsey, do you think we need to update the release version of Ensembl that we're currently using on KG2?

already done (in code that is; you won't see the new build for a couple of weeks probably); see #1374

chunyuma commented 3 years ago

Ah, thanks @saramsey. No problem! Thanks so much for KG2 team's hard work on fixing these issues that I reported.

saramsey commented 3 years ago

Note that for DTD stuff, while there may be some "gold" in noncoding genes as therapeutic targets, they are not the "classical" drug targets for medicinal chemists, as I understand it.

lncRNAs, however, could themselves (or homologs) be therapeutic; see https://www.sciencedirect.com/science/article/pii/S2211383520307346

So yes, we do want KG2 to have information connecting lncRNAs to pathways, phenotypes, interacting partners, etc.

saramsey commented 3 years ago

Also, at the risk of undercutting my general point above, there is emerging literature on the ASO-druggability of lncRNAs; for example:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5941496/

saramsey commented 3 years ago

KG2 doesn't have their noncoding RNA info yet (at least not from HGNC, NCBIGene and Ensembl):

Note, we have an ETL of miRbase that should help with the microRNAs; see #1247. I believe this should be in KG2.6.0.

chunyuma commented 3 years ago

Thanks @saramsey, in DTD, actually we're still interested in how drug treats disease but hope the model can base on some connections between drug and gene to explain the model. So if we can include RNA information, it might be helpful.

saramsey commented 3 years ago

As a side effect of this issue, another NCBIGene issue (which is super minor, but still maybe worth mentioning) has come to light: #1379

saramsey commented 3 years ago

In KG2.5.2, HGNC:39380 does indeed seem like it would be a "only connected to Homo sapiens" gene in KG2c:

match (n:`biolink:Gene` {id: 'HGNC:39380'}) return n;

It's a pseudogene so I guess I don't expect pathway, protein, or phenotype links.

saramsey commented 3 years ago

So @chunyuma are there any other individual genes you want me to check in KG2? If not, I will close this issue out tomorrow I suppose. Happy to keep it open if there is more detective work for me to do.

chunyuma commented 3 years ago

As a side effect of this issue, another NCBIGene issue (which is super minor, but still maybe worth mentioning) has come to light: #1379

Hi @saramsey, before we close this issue, can I know what is this NCBIGene issue? I don't quite understand the issue in #1379. Thanks!

RTXteam / RTX

Do we lose some knowledge sources that might connect ~54K genes to other bioentities? #1376