Closed chunyuma closed 3 years ago
As a side effect of this issue, another NCBIGene issue (which is super minor, but still maybe worth mentioning) has come to light: #1379
Hi @saramsey, before we close this issue, can I know what is this NCBIGene issue? I don't quite understand the issue in #1379. Thanks!
The script ncbigene_tsv_to_kg_json.py
incorrectly assigns the category biolink:MicroRNA
to NCBI microRNA gene nodes. Instead, it should assign category biolink:Gene
to those nodes. Fixing that bug also necessitates changing the predicate used to map a cross-reference between a NCBI microRNA gene and a miRBase record; instead of biolink:same_as
, we will have to use biolink:has_gene_product
; that is the second part of issue #1379.
Ah, I see, thanks @saramsey! I think it is good to close this issue. So I'll close this issue.
Hi KG2team (@saramsey, @kvarforl, @ericawood) again,
I just found one more potential issue for KG2 (Sorry for reporting many KG2 issues recently). I discussed with @dkoslicki and decided to open this issue to report it to you.
I investigated this based on KG2.5.2c but it should be associated with KG2.5.2. Currently, I found that we have around ~54K
biolink:Gene
nodes only connected to 'Homo sapiens' curie (CHEMBL.TARGET:CHEMBL372
) but not connected to any other nodes.The reason why I did this investigation is because I found there are only ~4K
biolink:Gene
curies which are fromHGNC
,NCBIGene
andEnsembl
and also have gene sequence information (although we actually have plenty of genes in KG2 having gene sequence information from these three sources (eg. ‘NCBIGene’: 59060, ‘ENSEMBL’: 67929, ‘HGNC’: 40311)) after I excluded some node types (Please see below) in KG2c for simplifying KG in order to get better explanation for DTD model.In the figure below, the node types with black color are the ones that I excluded.
So then I investigated why there are so many genes lost after I excluded these node types. I suspect that they might be only connected to the node types that I excluded.
Here is my investigation result:
As you can see, all these lost genes (
53734
) are connected tobiolink:OrganismTaxon
and withinbiolink:OrganismTaxon
(Please see below), almost all of them are only connected toHome sapiens
. This means that if thisHome sapiens
curie is excluded, then all these genes become isolated nodes.@dkoslicki thinks that perhaps some knowledge sources got dropped somewhere that would connect those genes to other bioentities? Could you please help us take a look for this issue? Thank you so much!