RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
34 stars 9 forks source link

Conversion Script Errors #296

Closed ecwood closed 11 months ago

ecwood commented 1 year ago

Inspired by https://github.com/RTXteam/RTX-KG2/issues/291#issuecomment-1604448179 and @acevedol's observation that the conversion scripts also fail, it seems like it is also important to look at the state of conversion scripts:

Conversion Script Working?
chembl_mysql_to_kg_json.py This seems to be working as expected.
dgidb_tsv_to_kg_json.py This seems to be working as expected.
disgenet_tsv_to_kg_json.py This seems to be working as expected.
drugbank_xml_to_kg_json.py This seems to be working as expected.
drugcentral_json_to_kg_json.py Waiting for #295 to check This seems to be working as expected.
ensembl_json_to_kg_json.py This seems to be working as expected.
go_gpa_to_kg_json.py This seems to be working as expected.
hmdb_xml_to_kg_json.py This seems to be working as expected.
intact_tsv_to_kg_json.py This seems to be working as expected.
jensenlab_tsv_to_kg_json.py This seems to be working as expected.
kegg_json_to_kg_json.py This seems to be working as expected.
mirbase_dat_to_kg_json.py This seems to be working as expected.
multi_ont_to_json_kg.py Fails, see #300
ncbigene_tsv_to_kg_json.py This seems to be working as expected.
reactome_mysql_to_kg_json.py Fails, see #297
repodb_csv_to_kg_json.py Fails, see #298
semmeddb_tuple_list_json_to_kg_json.py Waiting for #294 to check Fails, see #310
smpdb_csv_to_kg_json.py This seems to be working as expected.
unichem_tsv_to_edges_json.py Waiting for #293 to check This seems to be working as expected.
uniprotkb_dat_to_json.py Fails, see #299
ecwood commented 1 year ago

After #304 showed us that, despite the program completing, there can still be issues with the data, I think we also need to compare all of the kg2-*.json file sizes with those from KG2.8.3 to make sure they are at least roughly the same.

ecwood commented 1 year ago

Here's a new table for checking if the output is actually reasonable, compared to KG2.8.3:

Conversion Script Output Size Correct? Output Report Reasonable?
chembl_mysql_to_kg_json.py KG2.8.3: 2.4G
Current: 2.3G
No, see https://github.com/RTXteam/RTX-KG2/issues/311#issuecomment-1624427284
dgidb_tsv_to_kg_json.py KG2.8.3: 34M
Current: 36M
Reports are identical in everything except _report_datetime
disgenet_tsv_to_kg_json.py KG2.8.3: 265M
Current: 279M
Reports are identical in everything except _report_datetime
drugbank_xml_to_kg_json.py KG2.8.3: 1.8G
Current: 1.9G
Since DrugBank was upgraded, they are different. There is a new DrugBank predicate DRUGBANK:carrier that needs to be mapped. There are one less instances of DRUGBANK:inducer, DRUGBANK:neutralizer, and DRUGBANK:stimulator. There are 63 less instances of DRUGBANK:target. All other changes resulted in added edges and nodes. One issue is that the version nodes look the same since the XML versioning doesn't have the minor release. We may want to bring this up with DrugBank or devise our own method for versioning.
drugcentral_json_to_kg_json.py KG2.8.3: 195M
Current: 206M
Reports are identical in everything except _report_datetime
ensembl_json_to_kg_json.py KG2.8.3: 1.4G
Current: 1.5G
Reports are identical in everything except _report_datetime. Note: we are using Ensembl Genes v106. The most updated version is Ensembl Genes v109.
go_gpa_to_kg_json.py KG2.8.3: 293M
Current: 310M
There are two less instances of GO:acts_upstream_of_or_within, five less instances of GO:colocalizes_with, ten less instances of GO:contributes_to, and 850 less instances of GO:located_in. All other predicates saw increases in their number of edges. The overall number of edges increased. There was a new release on 6/11/2023.
hmdb_xml_to_kg_json.py KG2.8.3: 1.5G
Current: 1.6M
Reports are identical in everything except _report_datetime
intact_tsv_to_kg_json.py KG2.8.3: 304M
Current: 328M
There are 1835 less instances of MI:0407. All other predicates saw increases in their number of edges. The overall number of edges increased. There was a new release on 6/3/2023 (monthly releases). The releases are not versioned, at least in KG2.
jensenlab_tsv_to_kg_json.py KG2.8.3: 986M
Current: 1011M
There are 1847 less instances of JensenLab:associated_with, the only predicate in this dataset. The overall number of edges decreased. It releases weekly.
kegg_json_to_kg_json.py As of a bit ago with #304, the output size is reasonable.
KG2.8.3: 124M
Current: 195M
As of https://github.com/RTXteam/RTX-KG2/issues/304#issuecomment-1614260267, this is reasonable.
mirbase_dat_to_kg_json.py KG2.8.3: 3.3M
Current: 3.4M
Reports are identical in everything except _report_datetime
multi_ont_to_json_kg.py KG2.8.3: 12G
Current:
ncbigene_tsv_to_kg_json.py KG2.8.3: 214M
Current: 273M
There are four less instances of biolink:related_to. There are 101 less instances of nucleic_acid_entity. All other predicates and categories saw increases in their number of nodes and edges, respectively. Overall, the number of nodes and edges increased.
reactome_mysql_to_kg_json.py KG2.8.3: 182M
Current: 191M
There are 562 less instances of REACT:has_member, 21 less instances of REACT:linked_to_disease, and 490 less instances of biolink:same_as. There are 540 instances of protein. All other predicates and categories saw increases in their number of nodes and edges, respectively. Overall, the number of nodes decreased and the number of edges increased. This corresponds with an upgrade from Reactome v83 to Reactome v85.
repodb_csv_to_kg_json.py KG2.8.3: 5.7M
Current: 6.0M
Reports are identical in everything except _report_datetime
semmeddb_tuple_list_json_to_kg_json.py KG2.8.3: 38G
Current: 43G
smpdb_csv_to_kg_json.py KG2.8.3: 19G
Current: 20G
Reports are identical in everything except _report_datetime. SMPDB hasn't been updated since 2018/2019 and PathWhiz hasn't been updated since 2020.
unichem_tsv_to_edges_json.py KG2.8.3: 57M
Current: 80M
The number of edges increased with the version update. One problem however, is that the report script used to pick up the source node but doesn't anymore. It's not vital, since there's no version in it anymore, but it seems worth checking out.
uniprotkb_dat_to_json.py KG2.8.3: 173M
Current: 128M
The predicate biolink:causes was added. Overall, the number of nodes and edges increased. This corresponds with the upgrade from UniProtKB v2022_05 to UniProtKB v2023_02
ecwood commented 11 months ago

I am closing this issue because the code worked in KG2.8.4pre's build.