RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 20 forks source link

Do we lose some knowledge sources that might connect ~54K genes to other bioentities? #1376

Closed chunyuma closed 3 years ago

chunyuma commented 3 years ago

Hi KG2team (@saramsey, @kvarforl, @ericawood) again,

I just found one more potential issue for KG2 (Sorry for reporting many KG2 issues recently). I discussed with @dkoslicki and decided to open this issue to report it to you.

I investigated this based on KG2.5.2c but it should be associated with KG2.5.2. Currently, I found that we have around ~54K biolink:Gene nodes only connected to 'Homo sapiens' curie (CHEMBL.TARGET:CHEMBL372) but not connected to any other nodes.

The reason why I did this investigation is because I found there are only ~4K biolink:Gene curies which are from HGNC, NCBIGene and Ensembl and also have gene sequence information (although we actually have plenty of genes in KG2 having gene sequence information from these three sources (eg. ‘NCBIGene’: 59060, ‘ENSEMBL’: 67929, ‘HGNC’: 40311)) after I excluded some node types (Please see below) in KG2c for simplifying KG in order to get better explanation for DTD model.

In the figure below, the node types with black color are the ones that I excluded.

Screen_Shot_2021-04-08_at_7 11 40_PM

So then I investigated why there are so many genes lost after I excluded these node types. I suspect that they might be only connected to the node types that I excluded.

Here is my investigation result:

Screen Shot 2021-04-15 at 1 25 11 AM

As you can see, all these lost genes (53734) are connected to biolink:OrganismTaxon and within biolink:OrganismTaxon (Please see below), almost all of them are only connected to Home sapiens. This means that if this Home sapiens curie is excluded, then all these genes become isolated nodes.

Screen Shot 2021-04-15 at 1 25 20 AM

@dkoslicki thinks that perhaps some knowledge sources got dropped somewhere that would connect those genes to other bioentities? Could you please help us take a look for this issue? Thank you so much!

saramsey commented 3 years ago

As a side effect of this issue, another NCBIGene issue (which is super minor, but still maybe worth mentioning) has come to light: #1379

Hi @saramsey, before we close this issue, can I know what is this NCBIGene issue? I don't quite understand the issue in #1379. Thanks!

The script ncbigene_tsv_to_kg_json.py incorrectly assigns the category biolink:MicroRNA to NCBI microRNA gene nodes. Instead, it should assign category biolink:Gene to those nodes. Fixing that bug also necessitates changing the predicate used to map a cross-reference between a NCBI microRNA gene and a miRBase record; instead of biolink:same_as, we will have to use biolink:has_gene_product; that is the second part of issue #1379.

chunyuma commented 3 years ago

Ah, I see, thanks @saramsey! I think it is good to close this issue. So I'll close this issue.