Closed saramsey closed 4 years ago
Report from @ericawood:
Hi Steve, After more digging, here is what I have discovered:
- As far as I can tell, the JSON files have the same content
- The
go-plus.owl
files are the same every time- The
go-plus.json
andumls-go.json
files are in a different order each time- The resulting KG JSON files appear to have the exact same edges (though not necessarily in the same order)
- The resulting KG JSON files appear to have the same nodes, though the synonym fields are in different orders, which triggers my JSON equivalency validation code to assume that each file is missing a node that is actually the same node. The wording here is very difficult, so here is a screenshot:
Here's a snapshot of the diff log between go-plus.json files that were generated by the same code one right after the other:
Here's a chunk of the two go-plus.json files side by side:
@ericawood I think I've fixed the synonym issue with f84eecc
As for the ordering of the JSON files generated by ROBOT from TTL files, more investigation required...
@ericawood can you test to see if the synonym ordering is now consistent?
Ah, I think I need to make a second tweak to multi_ont_to_json_kg.py
; see ac9a304
@saramsey I ran the test to see if synonym ordering is now consistent. I found that the publications list is also not consistent:
The synonyms issue appears to be resolved though.
@ericawood for the publications issue, can you please see if commits f128541 and 329e973 fix the issue?
@saramsey The publications issue was fixed. However, there is another issue. For multiple CHEBI nodes, two nodes with the same ID have different categories (and category labels) depending on the file. Here is an example:
Hi @ericawood thanks for bringing the CHEBI example to my attention. I wonder if changing the order of ont-load-inventory.yaml
so that chebi.owl
is loaded before go-plus.owl
would solve this issue. I would like to test this idea. Can you please describe for me exactly how to reproduce the CHEBI problem? i.e., which scripts I should run and which files I should inspect to see the problem? Thanks.
Hi @saramsey
Can you please describe for me exactly how to reproduce the CHEBI problem? i.e., which scripts I should run and which files I should inspect to see the problem?
Here are the steps:
cd ~
mkdir SnakemakeDebugging
<- This is just a place to put the log files so that you can easily find them, feel free to name it something different
cd SnakemakeDebugging
~/kg2-venv/bin/python3 -u ~/kg2-code/multi_ont_to_json_kg.py --test ~/kg2-code/curies-to-categories.yaml ~/kg2-code/curies-to-urls-map.yaml ~/kg2-code/ont-load-inventory-test-go.yaml ~/kg2-build/kg2-ont-go1.json >~/kg2-build/build-kg2-ont-go-stderr8-11--1.log 2>&1
<- A few notes here:
(1) ~/kg2-build/kg2-ont-go1.json
is the output files that you will be comparing (the 1 refers to it being the first one). If you choose to name it something different, make sure that it is easy to differentiate output files 1 and 2.
(2) This is going to take a while. Expect it to take upwards of an hour.
(3) In ~/kg2-build/build-kg2-ont-go-stderr8-11--1.log
the 8-11
refers to the date and the --1
refers to this being the first run. I like to keep the output from each run through separate.
(4) You might also choose to put kg2-ont-go1.json
and build-kg2-ont-go-stderr8-11--1.log
in SnakemakeDebugging
. It is your choice, it won't break anything.
~/kg2-venv/bin/python3 -u ~/kg2-code/multi_ont_to_json_kg.py --test ~/kg2-code/curies-to-categories.yaml ~/kg2-code/curies-to-urls-map.yaml ~/kg2-code/ont-load-inventory-test-go.yaml ~/kg2-build/kg2-ont-go2.json >~/kg2-build/build-kg2-ont-go-stderr8-11--2.log 2>&1
<- See notes from step 4 and adjust accordingly for "round" 2
Ask Erica for access to check_missing.py (my custom script to produce the output above) and put it in SnakemakeDebugging
~/kg2-venv/bin/python3 -u check_missing.py ~/kg2-build/kg2-ont-go1.json ~/kg2-build/kg2-ont-go2.json > g1g2difference.log 2>&1
<- ~/kg2-build/kg2-ont-go1.json
refers to the first file that you generated. It will be referred to as "A" in the log file. ~/kg2-build/kg2-ont-go2.json
refers to the second file that you generated. It will be referred to as "B" in the log file. g1g2difference.log
is the file that the data you are looking for will be in.
Open two terminals side by side and open g1g2difference.log
in the editor of your choice in both. In one terminal, search for "A MISSING NODES". In the other, search for "B MISSING NODES". Unlike diff, each output shows you a node that is present in one of the two output files and not the other, such that order doesn't matter. "A MISSING NODES" refers to nodes that are in file "A" but not in file "B". "B MISSING NODES" refers to nodes that are in file "B" but not in file "A".
Please let me know if you have any questions!
Thanks for the detailed instructions.
OK, small progress. I managed to make a simpler test-case. Using this file ont-load-inventory-issue999.yaml
:
- # maps to CURIE prefix: biolink
url: https://raw.githubusercontent.com/biolink/biolink-model/master/biolink-model.owl
file: biolink-model.owl
download: true
title: Biolink meta-model
-
url: http://purl.obolibrary.org/obo/chebi.owl
file: chebi.owl
download: true
title: Chemical Entities of Biological Interest
- # maps to CURIE prefix: GO
url: http://purl.obolibrary.org/obo/go/extensions/go-plus.owl
file: go-plus.owl
title: Gene Ontology
download: true
running on my MBP,
python -u multi_ont_to_json_kg.py --test curies-to-categories.yaml curies-to-urls-map.yaml ont-load-inventory-issue999.yaml test999-a.json
python -u get_nodes_json_from_kg_json.py --test test999-a.json test999-a-nodes.json
python -u multi_ont_to_json_kg.py --test curies-to-categories.yaml curies-to-urls-map.yaml ont-load-inventory-issue999.yaml test999-b.json
python -u get_nodes_json_from_kg_json.py --test test999-b.json test999-b-nodes.json
diff test999-a-nodes.json test999-b-nodes.json
I'm seeing a huge list of differences:
---
> "category": "biolink:Metabolite",
> "category label": "metabolite",
2223900,2223901c2223900,2223901
< "category": "biolink:BiologicalEntity",
< "category label": "biological_entity",
---
> "category": "biolink:FunctionalAssociation",
> "category label": "functional_association",
2269022,2269023c2269022,2269023
< "category": "biolink:BiologicalEntity",
< "category label": "biological_entity",
---
> "category": "biolink:FunctionalAssociation",
> "category label": "functional_association",
2269056,2269057c2269056,2269057
< "category": "biolink:BiologicalEntity",
< "category label": "biological_entity",
---
> "category": "biolink:FunctionalAssociation",
> "category label": "functional_association",
2326161,2326162c2326161,2326162
< "category": "biolink:FunctionalAssociation",
< "category label": "functional_association",
---
> "category": "biolink:Drug",
> "category label": "drug",
2365338,2365339c2365338,2365339
< "category": "biolink:Drug",
< "category label": "drug",
---
> "category": "biolink:FunctionalAssociation",
> "category label": "functional_association",
2365459,2365460c2365459,2365460
< "category": "biolink:Drug",
< "category label": "drug",
---
> "category": "biolink:FunctionalAssociation",
> "category label": "functional_association",
2376926,2376927c2376926,2376927
< "category": "biolink:FunctionalAssociation",
< "category label": "functional_association",
---
> "category": "biolink:Drug",
> "category label": "drug",
2392376,2392377c2392376,2392377
< "category": "biolink:FunctionalAssociation",
< "category label": "functional_association",
---
> "category": "biolink:Drug",
> "category label": "drug",
Investigating CHEBI:75769
, well lookee here, it has two parents in the ontology,
Notes:
CHEBI:50733
is a direct descendent of CHEBI:23888
which is annotated as drug
in curies-to-categories.yaml
.CHEBI:27314
is a descendent of CHEBI:78295
, which is annotated as biological entity
in curies-to-categories.yaml
.Theory: the order of the items in the list for the outer loop is not consistent from run to run of multi_ont_to_json_kg.py
, and it terminates on the first time through the loop due to the break
on line 315:
Trying a fix now....
Hi @ericawood can I close this issue out?
Hi @saramsey, can't quite close this out. For the Snakemake vs Sequential build to ensure that the two build systems line up. Rather than diff-ing the two JSON files, I decided to diff the two edges.tsv and two nodes.tsv files, as they are separated line by line rather than all on one line. Both were using the same version of the code (I didn't git pull
in between builds) and when I ran git status
, my local copy was up to date with the exception of a couple of changes I made to version.sh
and build-kg2.sh
to prevent the code from updating to the S3 bucket -- I'm not totally sure how that works because I pulled on Friday. Here is some of the output:
For nodes_header.tsv
category :LABEL creation_date deprecated description full_name id:ID iri name provided_by publications:string[] replaced_by synonym:string[] update_date category_label
nodes.tsv diff (snippet of 596M file):
6794143c6794143
< biolink:Gene gene False A protein coding gene LPPR1 in human. // COMMENTS: Category=external.; Type:protein-coding; Locus:9q31.1; NameStatus:official phospholipid phosphatase related 1 NCBIGene:54886 https://identifiers.org/ncbigene:54886 LPPR1 (human) OBO:pr.owl plasticity-related gene 3 protein; LPPR1; PLPPR1; PRG-3; lipid phosphate phosphatase-related protein type 1; phospholipid phosphatase-related protein type 1; plasticity related gene 3 2020-08-03 06:18:13 GMT gene^M
---
> biolink:Gene gene False A protein coding gene LPPR1 in human. // COMMENTS: Category=external.; Type:protein-coding; Locus:9q31.1; NameStatus:official phospholipid phosphatase related 1 NCBIGene:54886 https://identifiers.org/ncbigene:54886 LPPR1 (human) OBO:pr.owl PRG-3; LPPR1; PLPPR1; lipid phosphate phosphatase-related protein type 1; phospholipid phosphatase-related protein type 1; plasticity related gene 3; plasticity-related gene 3 protein 2020-08-03 06:18:13 GMT gene^M
6796047c6796047
< biolink:Gene gene False A protein coding gene LPPR4 in human. // COMMENTS: Category=external.; Type:protein-coding; Locus:1p21.3-p21.2; NameStatus:official phospholipid phosphatase related 4 NCBIGene:9890 https://identifiers.org/ncbigene:9890 LPPR4 (human) OBO:pr.owl plasticity related gene 1; LPPR4; LPR4; PHP1; PLPPR4; PRG-1; PRG1; brain-specific phosphatidic acid phosphatase-like protein 1; lipid phosphate phosphatase-related protein type 4; phospholipid phosphatase-related protein type 4; plasticity-related gene 1 protein 2020-08-03 06:18:13 GMT gene^M
---
> biolink:Gene gene False A protein coding gene LPPR4 in human. // COMMENTS: Category=external.; Type:protein-coding; Locus:1p21.3-p21.2; NameStatus:official phospholipid phosphatase related 4 NCBIGene:9890 https://identifiers.org/ncbigene:9890 LPPR4 (human) OBO:pr.owl PHP1; LPPR4; LPR4; PLPPR4; PRG-1; PRG1; brain-specific phosphatidic acid phosphatase-like protein 1; lipid phosphate phosphatase-related protein type 4; phospholipid phosphatase-related protein type 4; plasticity related gene 1; plasticity-related gene 1 protein 2020-08-03 06:18:13 GMT gene^M
For edges_header.tsv
edge_label negated :END_ID provided_by:string[] publications:string[] publications_info relation edge_label:TYPE simplified_relation :START_ID update_date simplified_edge_label subject object
edges.tsv diff (snippet of 32G file):
11143a11143
> xref False UMLS:C3703015 umls_source:CPT {} oboFormat:xref close_match skos:closeMatch CPT:27390 Sat Aug 1 00:43:02 2020 close_match CPT:27390 UMLS:C3703015^M
11255d11254
< xref False UMLS:C3516672 umls_source:CPT {} oboFormat:xref close_match skos:closeMatch CPT:88381 Sat Aug 1 00:43:02 2020 close_match CPT:88381 UMLS:C3516672^M
11256a11256
> xref False UMLS:C3516672 umls_source:CPT {} oboFormat:xref close_match skos:closeMatch CPT:88381 Sat Aug 1 00:43:02 2020 close_match CPT:88381 UMLS:C3516672^M
11498d11497
< xref False UMLS:C3702541 umls_source:CPT {} oboFormat:xref close_match skos:closeMatch CPT:0339T Sat Aug 1 00:43:02 2020 close_match CPT:0339T UMLS:C3702541^M
11500c11499
< xref False UMLS:C3517570 umls_source:CPT {} oboFormat:xref close_match skos:closeMatch CPT:0206T Sat Aug 1 00:43:02 2020 close_match CPT:0206T UMLS:C3517570^M
---
> xref False UMLS:C3702541 umls_source:CPT {} oboFormat:xref close_match skos:closeMatch CPT:0339T Sat Aug 1 00:43:02 2020 close_match CPT:0339T UMLS:C3702541^M
11501a11501
> xref False UMLS:C3517570 umls_source:CPT {} oboFormat:xref close_match skos:closeMatch CPT:0206T Sat Aug 1 00:43:02 2020 close_match CPT:0206T UMLS:C3517570^M
11520d11519
< xref False UMLS:C3865608 umls_source:CPT {} oboFormat:xref close_match skos:closeMatch CPT:27370 Sat Aug 1 00:43:02 2020 close_match CPT:27370 UMLS:C3865608^M
11521a11521
> xref False UMLS:C3865608 umls_source:CPT {} oboFormat:xref close_match skos:closeMatch CPT:27370 Sat Aug 1 00:43:02 2020 close_match CPT:27370 UMLS:C3865608^M
11564d11563
< xref False UMLS:C0374537 umls_source:CPT {} oboFormat:xref close_match skos:closeMatch CPT:99058 Sat Aug 1 00:43:02 2020 close_match CPT:99058 UMLS:C0374537^M
11565a11565
> xref False UMLS:C0374537 umls_source:CPT {} oboFormat:xref close_match skos:closeMatch CPT:99058 Sat Aug 1 00:43:02 2020 close_match CPT:99058 UMLS:C0374537^M
11588d11587
< xref False UMLS:C3517573 umls_source:CPT {} oboFormat:xref close_match skos:closeMatch CPT:0209T Sat Aug 1 00:43:02 2020 close_match CPT:0209T UMLS:C3517573^M
I think that the nodes issue is part of this issue. I am unsure about the edges issue, since I don't remember that being significant before (but I also wasn't running diff
, I was running a script that compared each element with the other dictionary to see if it was present anywhere).
Viewing the bottom of the nodes.tsv diff, I suspect this is also an issue with the NCBIGene ETL:
9135161,9135163c9135161,9135163
< biolink:Gene gene False CD209 promoter region; Type:biological-region; Locus:19p; NameStatus:unofficial CD209 promoter region NCBIGene:117307477 https://identifiers.org/ncbigene:117307477 CD209 promoter region identifiers_org_registry:ncbigene LOC117307477; DC-SIGN promoter 20200801 gene^M
< biolink:Gene gene False CLEC4M promoter region; Type:biological-region; Locus:19p; NameStatus:unofficial CLEC4M promoter region NCBIGene:117307478 https://identifiers.org/ncbigene:117307478 CLEC4M promoter region identifiers_org_registry:ncbigene LOC117307478; DC-SIGNR promoter 20200801 gene^M
< biolink:Gene gene False Type:pseudo; Locus:2q21.2; NameStatus:official CDRT15 pseudogene 4 NCBIGene:117314529 https://identifiers.org/ncbigene:117314529 CDRT15 pseudogene 4 identifiers_org_registry:ncbigene CDRT15P4; CMT1A duplicated region transcript 15 pseudogene 4 20200801 gene^M
---
> biolink:Gene gene False CD209 promoter region; Type:biological-region; Locus:19p; NameStatus:unofficial CD209 promoter region NCBIGene:117307477 https://identifiers.org/ncbigene:117307477 CD209 promoter region identifiers_org_registry:ncbigene DC-SIGN promoter; LOC117307477 20200801 gene^M
> biolink:Gene gene False CLEC4M promoter region; Type:biological-region; Locus:19p; NameStatus:unofficial CLEC4M promoter region NCBIGene:117307478 https://identifiers.org/ncbigene:117307478 CLEC4M promoter region identifiers_org_registry:ncbigene DC-SIGNR promoter; LOC117307478 20200801 gene^M
> biolink:Gene gene False Type:pseudo; Locus:2q21.2; NameStatus:official CDRT15 pseudogene 4 NCBIGene:117314529 https://identifiers.org/ncbigene:117314529 CDRT15 pseudogene 4 identifiers_org_registry:ncbigene CMT1A duplicated region transcript 15 pseudogene 4; CDRT15P4 20200801 gene^M
9135195c9135195
< biolink:Gene gene False Type:pseudo; Locus:Yq11.222; NameStatus:official elongin C pseudogene 26 NCBIGene:118097967 https://identifiers.org/ncbigene:118097967 elongin C pseudogene 26 identifiers_org_registry:ncbigene TCEB1P26; transcription elongation factor B (SIII), polypeptide 1 pseudogene 26; ELOCP26; transcription elongation factor B subunit 1 pseudogene 26; ELOC26 20200627 gene^M
---
> biolink:Gene gene False Type:pseudo; Locus:Yq11.222; NameStatus:official elongin C pseudogene 26 NCBIGene:118097967 https://identifiers.org/ncbigene:118097967 elongin C pseudogene 26 identifiers_org_registry:ncbigene transcription elongation factor B subunit 1 pseudogene 26; ELOC26; ELOCP26; TCEB1P26; transcription elongation factor B (SIII), polypeptide 1 pseudogene 26 20200627 gene^M
9135197c9135197
< biolink:Gene gene False GUCA1ANB-GUCA1A readthrough; Type:protein-coding; NameStatus:unofficial GUCA1ANB-GUCA1A readthrough NCBIGene:118142757 https://identifiers.org/ncbigene:118142757 GUCA1ANB-GUCA1A readthrough identifiers_org_registry:ncbigene GUCA1A; GUCA1ANB-GUCA1A; guanylyl cyclase-activating protein 1 20200722 gene^M
---
> biolink:Gene gene False GUCA1ANB-GUCA1A readthrough; Type:protein-coding; NameStatus:unofficial GUCA1ANB-GUCA1A readthrough NCBIGene:118142757 https://identifiers.org/ncbigene:118142757 GUCA1ANB-GUCA1A readthrough identifiers_org_registry:ncbigene guanylyl cyclase-activating protein 1; GUCA1ANB-GUCA1A; GUCA1A 20200722 gene^M
Here are some from ChemBL:
7885691,7885694c7885691,7885694
< biolink:ChemicalSubstance chemical_substance False SID14735976; FULL_MW:265.40; MAX_FDA_APPROVAL_PHASE: 0 SID14735976 CHEMBL.COMPOUND:CHEMBL1368645 https://identifiers.org/chembl.compound:CHEMBL1368645 SID14735976 identifiers_org_registry:chembl InChI=1S/C16H27NO2/c1-13-11-14(2)16(15(3)12-13)19-10-6-4-5-7-17-8-9-18/h11-12,17-18H,4-10H2,1-3H3; SEJLAPHMCNYZHT-UHFFFAOYSA-N; Cc1cc(C)c(OCCCCCNCCO)c(C)c1; PUBCHEM_BIOASSAY:14735976; SID14735976 2018-12-10 chemical_substance^M
< biolink:ChemicalSubstance chemical_substance False SID24389608; FULL_MW:416.52; MAX_FDA_APPROVAL_PHASE: 0 SID24389608 CHEMBL.COMPOUND:CHEMBL1368646 https://identifiers.org/chembl.compound:CHEMBL1368646 SID24389608 identifiers_org_registry:chembl InChI=1S/C26H28N2O3/c1-18-7-12-24-21(17-31-26(24)19(18)2)14-25(29)28(16-23-6-5-13-30-23)15-20-8-10-22(11-9-20)27(3)4/h5-13,17H,14-16H2,1-4H3; YGUZVOHOFOCSDH-UHFFFAOYSA-N; CN(C)c1ccc(CN(Cc2occc2)C(=O)Cc3coc4c(C)c(C)ccc34)cc1; PUBCHEM_BIOASSAY:24389608; SID24389608 2018-12-10 chemical_substance^M
< biolink:ChemicalSubstance chemical_substance False SID22414438; FULL_MW:328.41; MAX_FDA_APPROVAL_PHASE: 0 SID22414438 CHEMBL.COMPOUND:CHEMBL1368647 https://identifiers.org/chembl.compound:CHEMBL1368647 SID22414438 identifiers_org_registry:chembl InChI=1S/C19H24N2O3/c1-19(2,3)16-12-17(21-24-16)20-18(22)13-8-10-15(11-9-13)23-14-6-4-5-7-14/h8-12,14H,4-7H2,1-3H3,(H,20,21,22); TUGIUFAVHHMDBA-UHFFFAOYSA-N; CC(C)(C)c1onc(NC(=O)c2ccc(OC3CCCC3)cc2)c1; PUBCHEM_BIOASSAY:22414438; SID22414438 2018-12-10 chemical_substance^M
< biolink:ChemicalSubstance chemical_substance False SID24817338; FULL_MW:267.24; MAX_FDA_APPROVAL_PHASE: 0 SID24817338 CHEMBL.COMPOUND:CHEMBL1368649 https://identifiers.org/chembl.compound:CHEMBL1368649 SID24817338 identifiers_org_registry:chembl InChI=1S/C14H9N3O3/c18-17(19)10-5-7-11(8-6-10)20-14-9-15-12-3-1-2-4-13(12)16-14/h1-9H; RZNQBDUFQHFIRM-UHFFFAOYSA-N; O-N+(=O)c1ccc(Oc2cnc3ccccc3n2)cc1; PUBCHEM_BIOASSAY:24817338; SID24817338 2018-12-10 chemical_substance^M
---
> biolink:ChemicalSubstance chemical_substance False SID14735976; FULL_MW:265.40; MAX_FDA_APPROVAL_PHASE: 0 SID14735976 CHEMBL.COMPOUND:CHEMBL1368645 https://identifiers.org/chembl.compound:CHEMBL1368645 SID14735976 identifiers_org_registry:chembl InChI=1S/C16H27NO2/c1-13-11-14(2)16(15(3)12-13)19-10-6-4-5-7-17-8-9-18/h11-12,17-18H,4-10H2,1-3H3; SEJLAPHMCNYZHT-UHFFFAOYSA-N; Cc1cc(C)c(OCCCCCNCCO)c(C)c1; SID14735976; PUBCHEM_BIOASSAY:14735976 2018-12-10 chemical_substance^M
> biolink:ChemicalSubstance chemical_substance False SID24389608; FULL_MW:416.52; MAX_FDA_APPROVAL_PHASE: 0 SID24389608 CHEMBL.COMPOUND:CHEMBL1368646 https://identifiers.org/chembl.compound:CHEMBL1368646 SID24389608 identifiers_org_registry:chembl InChI=1S/C26H28N2O3/c1-18-7-12-24-21(17-31-26(24)19(18)2)14-25(29)28(16-23-6-5-13-30-23)15-20-8-10-22(11-9-20)27(3)4/h5-13,17H,14-16H2,1-4H3; YGUZVOHOFOCSDH-UHFFFAOYSA-N; CN(C)c1ccc(CN(Cc2occc2)C(=O)Cc3coc4c(C)c(C)ccc34)cc1; SID24389608; PUBCHEM_BIOASSAY:24389608 2018-12-10 chemical_substance^M
> biolink:ChemicalSubstance chemical_substance False SID22414438; FULL_MW:328.41; MAX_FDA_APPROVAL_PHASE: 0 SID22414438 CHEMBL.COMPOUND:CHEMBL1368647 https://identifiers.org/chembl.compound:CHEMBL1368647 SID22414438 identifiers_org_registry:chembl InChI=1S/C19H24N2O3/c1-19(2,3)16-12-17(21-24-16)20-18(22)13-8-10-15(11-9-13)23-14-6-4-5-7-14/h8-12,14H,4-7H2,1-3H3,(H,20,21,22); TUGIUFAVHHMDBA-UHFFFAOYSA-N; CC(C)(C)c1onc(NC(=O)c2ccc(OC3CCCC3)cc2)c1; SID22414438; PUBCHEM_BIOASSAY:22414438 2018-12-10 chemical_substance^M
> biolink:ChemicalSubstance chemical_substance False SID24817338; FULL_MW:267.24; MAX_FDA_APPROVAL_PHASE: 0 SID24817338 CHEMBL.COMPOUND:CHEMBL1368649 https://identifiers.org/chembl.compound:CHEMBL1368649 SID24817338 identifiers_org_registry:chembl InChI=1S/C14H9N3O3/c18-17(19)10-5-7-11(8-6-10)20-14-9-15-12-3-1-2-4-13(12)16-14/h1-9H; RZNQBDUFQHFIRM-UHFFFAOYSA-N; O-N+(=O)c1ccc(Oc2cnc3ccccc3n2)cc1; SID24817338; PUBCHEM_BIOASSAY:24817338 2018-12-10 chemical_substance^M
Here are some from EnsemBL:
7112989c7112989
< biolink:Gene gene False chromosome X open reading frame 51A [Source:HGNC Symbol;Acc:HGNC:30533] ENSEMBL:ENSG00000224440 https://identifiers.org/ensembl:ENSG00000224440 chromosome X open reading frame 51A [Source:HGNC Symbol;Acc:HGNC:30533] identifiers_org_registry:ensembl CXorf51A; HGNC:30533; 100129239; A0A1B0GTR3 2019-03 gene^M
---
> biolink:Gene gene False chromosome X open reading frame 51A [Source:HGNC Symbol;Acc:HGNC:30533] ENSEMBL:ENSG00000224440 https://identifiers.org/ensembl:ENSG00000224440 chromosome X open reading frame 51A [Source:HGNC Symbol;Acc:HGNC:30533] identifiers_org_registry:ensembl CXorf51A; A0A1B0GTR3; 100129239; HGNC:30533 2019-03 gene^M
7112996c7112996
< biolink:Gene gene False Abelson helper integration site 1 [Source:HGNC Symbol;Acc:HGNC:21575] ENSEMBL:ENSG00000135541 https://identifiers.org/ensembl:ENSG00000135541 Abelson helper integration site 1 [Source:HGNC Symbol;Acc:HGNC:21575] identifiers_org_registry:ensembl AHI1; E9PML3; R-HSA-5620912; Q8N157; 608894; R-HSA-1852241; R-HSA-5617833; 54806; E9PI51; HGNC:21575; 608629; Q9NQN3 2019-03 gene^M
---
> biolink:Gene gene False Abelson helper integration site 1 [Source:HGNC Symbol;Acc:HGNC:21575] ENSEMBL:ENSG00000135541 https://identifiers.org/ensembl:ENSG00000135541 Abelson helper integration site 1 [Source:HGNC Symbol;Acc:HGNC:21575] identifiers_org_registry:ensembl AHI1; Q9NQN3; R-HSA-5620912; R-HSA-5617833; 54806; Q8N157; 608894; HGNC:21575; E9PML3; R-HSA-1852241; 608629; E9PI51 2019-03 gene^M
7112999c7112999
< biolink:Gene gene False RNA, U1 small nuclear 52, pseudogene [Source:HGNC Symbol;Acc:HGNC:48394] ENSEMBL:ENSG00000206917 https://identifiers.org/ensembl:ENSG00000206917 RNA, U1 small nuclear 52, pseudogene [Source:HGNC Symbol;Acc:HGNC:48394] identifiers_org_registry:ensembl RNU1-52P; HGNC:48394; RF00003 2019-03 gene^M
---
> biolink:Gene gene False RNA, U1 small nuclear 52, pseudogene [Source:HGNC Symbol;Acc:HGNC:48394] ENSEMBL:ENSG00000206917 https://identifiers.org/ensembl:ENSG00000206917 RNA, U1 small nuclear 52, pseudogene [Source:HGNC Symbol;Acc:HGNC:48394] identifiers_org_registry:ensembl RNU1-52P; RF00003; HGNC:48394 2019-03 gene^M
Here are some from UniProtKB:
7112155,7112156c7112155,7112156
< biolink:Protein protein 31-OCT-2006 False -!- CAUTION: Product of a dubious gene prediction. {ECO:0000305}. Putative uncharacterized protein LOC642776 UniProtKB:Q9BTK2 https://identifiers.org/uniprot:Q9BTK2 Putative uncharacterized protein LOC642776 identifiers_org_registry:uniprot PMID:16710414; PMID:15489334; DOI:10.1101/gr.2596504; DOI:10.1038/nature04727 SEQUENCE 45 AA; 4916 MW; 6A3D21D5765D6950 CRC64;\nMEGGMAAYPV ATRESRCRRG RIGVQPSPER RSEVVGPFPL ARSLS\n 11-DEC-2019 protein^M
< biolink:Protein protein 26-FEB-2008 False Putative uncharacterized protein FLJ39060 UniProtKB:Q8N8P6 https://identifiers.org/uniprot:Q8N8P6 Putative uncharacterized protein FLJ39060 identifiers_org_registry:uniprot DOI:10.1038/ng1285; PMID:14702039; PMID:15772651; DOI:10.1038/nature03440 SEQUENCE 123 AA; 13896 MW; E5DED0143EDCD905 CRC64;\nMVDGRTRTII NDIFFTEPTP EMSSLPVRSH SSLSLNLVSL MVICRGIIKL VIHFRMYCPP\nRLKAKHIEPT LRPVPLKELR ISHWPNECIR HSASVPMATG ANGLETKDET KRNAEKCACS\nVFL\n 12-AUG-2020 protein^M
---
> biolink:Protein protein 31-OCT-2006 False -!- CAUTION: Product of a dubious gene prediction. {ECO:0000305}. Putative uncharacterized protein LOC642776 UniProtKB:Q9BTK2 https://identifiers.org/uniprot:Q9BTK2 Putative uncharacterized protein LOC642776 identifiers_org_registry:uniprot PMID:16710414; PMID:15489334; DOI:10.1038/nature04727; DOI:10.1101/gr.2596504 SEQUENCE 45 AA; 4916 MW; 6A3D21D5765D6950 CRC64;\nMEGGMAAYPV ATRESRCRRG RIGVQPSPER RSEVVGPFPL ARSLS\n 11-DEC-2019 protein^M
> biolink:Protein protein 26-FEB-2008 False Putative uncharacterized protein FLJ39060 UniProtKB:Q8N8P6 https://identifiers.org/uniprot:Q8N8P6 Putative uncharacterized protein FLJ39060 identifiers_org_registry:uniprot DOI:10.1038/nature03440; PMID:14702039; PMID:15772651; DOI:10.1038/ng1285 SEQUENCE 123 AA; 13896 MW; E5DED0143EDCD905 CRC64;\nMVDGRTRTII NDIFFTEPTP EMSSLPVRSH SSLSLNLVSL MVICRGIIKL VIHFRMYCPP\nRLKAKHIEPT LRPVPLKELR ISHWPNECIR HSASVPMATG ANGLETKDET KRNAEKCACS\nVFL\n 12-AUG-2020 protein^M
7112364c7112364
< biolink:Protein protein 21-MAR-2012 False -!- ALTERNATIVE PRODUCTS:Event=Alternative splicing; Named isoforms=3;Name=2; Synonyms=ZHX1-C8orf76;IsoId=Q96EF9-1; Sequence=Displayed;Name=1;IsoId=Q9UKY1-1; Sequence=External;Name=3;IsoId=Q96K31-1; Sequence=External;-!- MISCELLANEOUS: [Isoform 2]: Based on a readthrough transcript which mayproduce a ZHX1-C8orf76 fusion protein. Zinc fingers and homeoboxes protein 1, isoform 2 UniProtKB:Q96EF9 https://identifiers.org/uniprot:Q96EF9 ZHX1-C8orf76 identifiers_org_registry:uniprot PMID:16421571; PMID:15489334; DOI:10.1101/gr.2596504; DOI:10.1038/nature04406 ZHX1-C8orf76; SEQUENCE 292 AA; 33285 MW; 3F5E55E9C6F23D93 CRC64;\nMLRKLWQWFY EETESSDDVE VLTLKKFKGD LAYRRQEYQK ALQEYSSISE KLSSTNFAMK\nRDVQEGQARC LAHLGRHMEA LEIAANLENK ATNTDHLTTV LYLQLAICSS LQNLEKTIFC\nLQKLISLHPF NPWNWGKLAE AYLNLGPALS AALASSQKQH SFTSSDKTIK SFFPHSGKDC\nLLCFPETLPE SSLFSVEANS SNSQKNEKAL TNIQNCMAEK RETVLIETQL KACASFIRTR\nLLLQFTQPQQ TSFALERNLR TQQEIEDKMK GFSFKEDTLL LIAEVSVSLG FM\n; ZHX1-C8orf76 readthrough transcript protein 12-AUG-2020 protein^M
---
> biolink:Protein protein 21-MAR-2012 False -!- ALTERNATIVE PRODUCTS:Event=Alternative splicing; Named isoforms=3;Name=2; Synonyms=ZHX1-C8orf76;IsoId=Q96EF9-1; Sequence=Displayed;Name=1;IsoId=Q9UKY1-1; Sequence=External;Name=3;IsoId=Q96K31-1; Sequence=External;-!- MISCELLANEOUS: [Isoform 2]: Based on a readthrough transcript which mayproduce a ZHX1-C8orf76 fusion protein. Zinc fingers and homeoboxes protein 1, isoform 2 UniProtKB:Q96EF9 https://identifiers.org/uniprot:Q96EF9 ZHX1-C8orf76 identifiers_org_registry:uniprot PMID:15489334; PMID:16421571; DOI:10.1101/gr.2596504; DOI:10.1038/nature04406 ZHX1-C8orf76; SEQUENCE 292 AA; 33285 MW; 3F5E55E9C6F23D93 CRC64;\nMLRKLWQWFY EETESSDDVE VLTLKKFKGD LAYRRQEYQK ALQEYSSISE KLSSTNFAMK\nRDVQEGQARC LAHLGRHMEA LEIAANLENK ATNTDHLTTV LYLQLAICSS LQNLEKTIFC\nLQKLISLHPF NPWNWGKLAE AYLNLGPALS AALASSQKQH SFTSSDKTIK SFFPHSGKDC\nLLCFPETLPE SSLFSVEANS SNSQKNEKAL TNIQNCMAEK RETVLIETQL KACASFIRTR\nLLLQFTQPQQ TSFALERNLR TQQEIEDKMK GFSFKEDTLL LIAEVSVSLG FM\n; ZHX1-C8orf76 readthrough transcript protein 12-AUG-2020 protein^M
FYI, the report files for the two builds were identical (with the exception of the report generated time).
Excellent detective work.
OK, so it looks like the synonym
list ordering is variable from run to run, is that correct?
Excellent detective work.
OK, so it looks like the
synonym
list ordering is variable from run to run, is that correct?
Yes, that was my conclusion as well.
Good idea to diff the nodes.tsv
files. Which direction is the diff? i.e., is <
the sequential build or is >
the sequential build?
On kg2-steve in kg2-build:
Edges:
diff SequentialBuild/edges.tsv SnakemakeBuild/edges.tsv > edges_tsv_diff.log
Nodes
diff SequentialBuild/nodes.tsv SnakemakeBuild/nodes.tsv > nodes_tsv_diff.log
Thanks, this is well organized.
For the moment, I'm focusing on the nodes issue.
Did you mean ParallelBuild
instead of SnakemakeBuild
?
Can you confirm -- you did a test build, is that correct?
Hi @saramsey,
1) The data is in SnakemakeBuild
on kg2steve
:
2) No, this was a partial build without test flags
Sorry, I was looking on kg2dev
. My bad.
Switching over to kg2steve
now....
On kg2steve
, I am running
python ~/kg2-code/get_nodes_json_from_kg_json.py --test kg2-simplified.json kg2-simplified-nodes.json
so I can have a look at the NCBIGene:54886
node
I posit that this (and analogous code in other modules) in ncbigene_tsv_to_kg.py
may be our smoking gun:
For NCBIGene:54886
, the ordering of synonym
entries in the file /home/ubuntu/kg2-build/SequentialBuild/kg2-simplified-nodes.json
on kg2steve.rtx.ai
is clearly incorrect (official gene symbol PLPPR1
is not first):
I propose the following remedy to ncbigene_tsv_to_kg_json.py
:
I note that /home/ubuntu/kg2-build/kg2-ncbigene.json
is also incorrect in the same way:
This increases my suspicion that the problem is in ncbigene_tsv_to_kg_json.py
.
Looks like my proposed fix is helping:
Edges from UniProt:
30825334a30825332
> physically_interacts_with False CHEBI:57394 identifiers_org_registry:uniprot {} biolink:physically_interacts_with physically_interacts_with biolink:physically_interacts_with UniProtKB:Q9NXF8 12-AUG-2020 physically_interacts_with UniProtKB:Q9NXF8 CHEBI:57394^M
30825337c30825335,30825336
< physically_interacts_with False CHEBI:143199 identifiers_org_registry:uniprot {} biolink:physically_interacts_with physically_interacts_with biolink:physically_interacts_with UniProtKB:Q9NXF8 12-AUG-2020 physically_interacts_with UniProtKB:Q9NXF8 CHEBI:143199^M
---
> physically_interacts_with False CHEBI:74151 identifiers_org_registry:uniprot {} biolink:physically_interacts_with physically_interacts_with biolink:physically_interacts_with UniProtKB:Q9NXF8 12-AUG-2020 physically_interacts_with UniProtKB:Q9NXF8 CHEBI:74151^M
> physically_interacts_with False CHEBI:143200 identifiers_org_registry:uniprot {} biolink:physically_interacts_with physically_interacts_with biolink:physically_interacts_with UniProtKB:Q9NXF8 12-AUG-2020 physically_interacts_with UniProtKB:Q9NXF8 CHEBI:143200^M
30825338a30825338,30825339
> physically_interacts_with False CHEBI:143199 identifiers_org_registry:uniprot {} biolink:physically_interacts_with physically_interacts_with biolink:physically_interacts_with UniProtKB:Q9NXF8 12-AUG-2020 physically_interacts_with UniProtKB:Q9NXF8 CHEBI:143199^M
> physically_interacts_with False CHEBI:57287 identifiers_org_registry:uniprot {} biolink:physically_interacts_with physically_interacts_with biolink:physically_interacts_with UniProtKB:Q9ULC8 12-AUG-2020 physically_interacts_with UniProtKB:Q9ULC8 CHEBI:57287^M
Hi @ericawood just checking back on this... do you think the above commits helped?
Hi @saramsey, I still haven't tested yet (please see my slack message). May I use kg2steve for testing this as it has constant pickle files for the ontologies? My plan is to first do an alltest build using Snakemake. Then, a test build using build-kg2.sh. Following that, I will compare the two.
Hi @saramsey, This issue was fixed in the test build I ran between Snakemake and Sequential. Should I close it out?
Thank you to @ericawood for bringing this issue to my attention