Closed ecwood closed 11 months ago
After #304 showed us that, despite the program completing, there can still be issues with the data, I think we also need to compare all of the kg2-*.json
file sizes with those from KG2.8.3
to make sure they are at least roughly the same.
Here's a new table for checking if the output is actually reasonable, compared to KG2.8.3
:
Conversion Script | Output Size Correct? | Output Report Reasonable? |
---|---|---|
chembl_mysql_to_kg_json.py | KG2.8.3 : 2.4GCurrent: 2.3G |
No, see https://github.com/RTXteam/RTX-KG2/issues/311#issuecomment-1624427284 |
dgidb_tsv_to_kg_json.py | KG2.8.3 : 34MCurrent: 36M |
Reports are identical in everything except _report_datetime |
disgenet_tsv_to_kg_json.py | KG2.8.3 : 265MCurrent: 279M |
Reports are identical in everything except _report_datetime |
drugbank_xml_to_kg_json.py | KG2.8.3 : 1.8GCurrent: 1.9G |
Since DrugBank was upgraded, they are different. There is a new DrugBank predicate DRUGBANK:carrier that needs to be mapped. There are one less instances of DRUGBANK:inducer , DRUGBANK:neutralizer , and DRUGBANK:stimulator . There are 63 less instances of DRUGBANK:target . All other changes resulted in added edges and nodes. One issue is that the version nodes look the same since the XML versioning doesn't have the minor release. We may want to bring this up with DrugBank or devise our own method for versioning. |
drugcentral_json_to_kg_json.py | KG2.8.3 : 195MCurrent: 206M |
Reports are identical in everything except _report_datetime |
ensembl_json_to_kg_json.py | KG2.8.3 : 1.4GCurrent: 1.5G |
Reports are identical in everything except _report_datetime . Note: we are using Ensembl Genes v106. The most updated version is Ensembl Genes v109. |
go_gpa_to_kg_json.py | KG2.8.3 : 293MCurrent: 310M |
There are two less instances of GO:acts_upstream_of_or_within , five less instances of GO:colocalizes_with , ten less instances of GO:contributes_to , and 850 less instances of GO:located_in . All other predicates saw increases in their number of edges. The overall number of edges increased. There was a new release on 6/11/2023. |
hmdb_xml_to_kg_json.py | KG2.8.3 : 1.5GCurrent: 1.6M |
Reports are identical in everything except _report_datetime |
intact_tsv_to_kg_json.py | KG2.8.3 : 304MCurrent: 328M |
There are 1835 less instances of MI:0407 . All other predicates saw increases in their number of edges. The overall number of edges increased. There was a new release on 6/3/2023 (monthly releases). The releases are not versioned, at least in KG2. |
jensenlab_tsv_to_kg_json.py | KG2.8.3 : 986MCurrent: 1011M |
There are 1847 less instances of JensenLab:associated_with , the only predicate in this dataset. The overall number of edges decreased. It releases weekly. |
kegg_json_to_kg_json.py | As of a bit ago with #304, the output size is reasonable.KG2.8.3 : 124MCurrent: 195M |
As of https://github.com/RTXteam/RTX-KG2/issues/304#issuecomment-1614260267, this is reasonable. |
mirbase_dat_to_kg_json.py | KG2.8.3 : 3.3MCurrent: 3.4M |
Reports are identical in everything except _report_datetime |
multi_ont_to_json_kg.py | KG2.8.3 : 12GCurrent: |
|
ncbigene_tsv_to_kg_json.py | KG2.8.3 : 214MCurrent: 273M |
There are four less instances of biolink:related_to . There are 101 less instances of nucleic_acid_entity . All other predicates and categories saw increases in their number of nodes and edges, respectively. Overall, the number of nodes and edges increased. |
reactome_mysql_to_kg_json.py | KG2.8.3 : 182MCurrent: 191M |
There are 562 less instances of REACT:has_member , 21 less instances of REACT:linked_to_disease , and 490 less instances of biolink:same_as . There are 540 instances of protein . All other predicates and categories saw increases in their number of nodes and edges, respectively. Overall, the number of nodes decreased and the number of edges increased. This corresponds with an upgrade from Reactome v83 to Reactome v85. |
repodb_csv_to_kg_json.py | KG2.8.3 : 5.7MCurrent: 6.0M |
Reports are identical in everything except _report_datetime |
semmeddb_tuple_list_json_to_kg_json.py | KG2.8.3 : 38GCurrent: 43G |
|
smpdb_csv_to_kg_json.py | KG2.8.3 : 19GCurrent: 20G |
Reports are identical in everything except _report_datetime . SMPDB hasn't been updated since 2018/2019 and PathWhiz hasn't been updated since 2020. |
unichem_tsv_to_edges_json.py | KG2.8.3 : 57MCurrent: 80M |
The number of edges increased with the version update. One problem however, is that the report script used to pick up the source node but doesn't anymore. It's not vital, since there's no version in it anymore, but it seems worth checking out. |
uniprotkb_dat_to_json.py | KG2.8.3 : 173MCurrent: 128M |
The predicate biolink:causes was added. Overall, the number of nodes and edges increased. This corresponds with the upgrade from UniProtKB v2022_05 to UniProtKB v2023_02 |
I am closing this issue because the code worked in KG2.8.4pre
's build.
Inspired by https://github.com/RTXteam/RTX-KG2/issues/291#issuecomment-1604448179 and @acevedol's observation that the conversion scripts also fail, it seems like it is also important to look at the state of conversion scripts:
Waiting for #295 to checkThis seems to be working as expected.Waiting for #294 to checkFails, see #310Waiting for #293 to checkThis seems to be working as expected.