RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

content of kg2-ont.json is changing from run to run #999

Closed saramsey closed 4 years ago

saramsey commented 4 years ago

Thank you to @ericawood for bringing this issue to my attention

saramsey commented 4 years ago

Report from @ericawood:

Hi Steve, After more digging, here is what I have discovered:

  • As far as I can tell, the JSON files have the same content
  • The go-plus.owl files are the same every time
  • The go-plus.json and umls-go.json files are in a different order each time
  • The resulting KG JSON files appear to have the exact same edges (though not necessarily in the same order)
  • The resulting KG JSON files appear to have the same nodes, though the synonym fields are in different orders, which triggers my JSON equivalency validation code to assume that each file is missing a node that is actually the same node. The wording here is very difficult, so here is a screenshot:

image

ecwood commented 4 years ago

Here's a snapshot of the diff log between go-plus.json files that were generated by the same code one right after the other: image

Here's a chunk of the two go-plus.json files side by side: image

saramsey commented 4 years ago

@ericawood I think I've fixed the synonym issue with f84eecc

As for the ordering of the JSON files generated by ROBOT from TTL files, more investigation required...

saramsey commented 4 years ago

@ericawood can you test to see if the synonym ordering is now consistent?

saramsey commented 4 years ago

Ah, I think I need to make a second tweak to multi_ont_to_json_kg.py; see ac9a304

ecwood commented 4 years ago

@saramsey I ran the test to see if synonym ordering is now consistent. I found that the publications list is also not consistent: image

The synonyms issue appears to be resolved though.

saramsey commented 4 years ago

@ericawood for the publications issue, can you please see if commits f128541 and 329e973 fix the issue?

ecwood commented 4 years ago

@saramsey The publications issue was fixed. However, there is another issue. For multiple CHEBI nodes, two nodes with the same ID have different categories (and category labels) depending on the file. Here is an example:
image

saramsey commented 4 years ago

Hi @ericawood thanks for bringing the CHEBI example to my attention. I wonder if changing the order of ont-load-inventory.yaml so that chebi.owl is loaded before go-plus.owl would solve this issue. I would like to test this idea. Can you please describe for me exactly how to reproduce the CHEBI problem? i.e., which scripts I should run and which files I should inspect to see the problem? Thanks.

ecwood commented 4 years ago

Hi @saramsey

Can you please describe for me exactly how to reproduce the CHEBI problem? i.e., which scripts I should run and which files I should inspect to see the problem?

Here are the steps:

  1. cd ~

  2. mkdir SnakemakeDebugging <- This is just a place to put the log files so that you can easily find them, feel free to name it something different

  3. cd SnakemakeDebugging

  4. ~/kg2-venv/bin/python3 -u ~/kg2-code/multi_ont_to_json_kg.py --test ~/kg2-code/curies-to-categories.yaml ~/kg2-code/curies-to-urls-map.yaml ~/kg2-code/ont-load-inventory-test-go.yaml ~/kg2-build/kg2-ont-go1.json >~/kg2-build/build-kg2-ont-go-stderr8-11--1.log 2>&1 <- A few notes here: (1) ~/kg2-build/kg2-ont-go1.json is the output files that you will be comparing (the 1 refers to it being the first one). If you choose to name it something different, make sure that it is easy to differentiate output files 1 and 2. (2) This is going to take a while. Expect it to take upwards of an hour. (3) In ~/kg2-build/build-kg2-ont-go-stderr8-11--1.log the 8-11 refers to the date and the --1 refers to this being the first run. I like to keep the output from each run through separate. (4) You might also choose to put kg2-ont-go1.json and build-kg2-ont-go-stderr8-11--1.log in SnakemakeDebugging. It is your choice, it won't break anything.

  5. ~/kg2-venv/bin/python3 -u ~/kg2-code/multi_ont_to_json_kg.py --test ~/kg2-code/curies-to-categories.yaml ~/kg2-code/curies-to-urls-map.yaml ~/kg2-code/ont-load-inventory-test-go.yaml ~/kg2-build/kg2-ont-go2.json >~/kg2-build/build-kg2-ont-go-stderr8-11--2.log 2>&1 <- See notes from step 4 and adjust accordingly for "round" 2

  6. Ask Erica for access to check_missing.py (my custom script to produce the output above) and put it in SnakemakeDebugging

  7. ~/kg2-venv/bin/python3 -u check_missing.py ~/kg2-build/kg2-ont-go1.json ~/kg2-build/kg2-ont-go2.json > g1g2difference.log 2>&1 <- ~/kg2-build/kg2-ont-go1.json refers to the first file that you generated. It will be referred to as "A" in the log file. ~/kg2-build/kg2-ont-go2.json refers to the second file that you generated. It will be referred to as "B" in the log file. g1g2difference.log is the file that the data you are looking for will be in.

  8. Open two terminals side by side and open g1g2difference.log in the editor of your choice in both. In one terminal, search for "A MISSING NODES". In the other, search for "B MISSING NODES". Unlike diff, each output shows you a node that is present in one of the two output files and not the other, such that order doesn't matter. "A MISSING NODES" refers to nodes that are in file "A" but not in file "B". "B MISSING NODES" refers to nodes that are in file "B" but not in file "A".

Please let me know if you have any questions!

saramsey commented 4 years ago

Thanks for the detailed instructions.

saramsey commented 4 years ago

OK, small progress. I managed to make a simpler test-case. Using this file ont-load-inventory-issue999.yaml:

- # maps to CURIE prefix: biolink
  url: https://raw.githubusercontent.com/biolink/biolink-model/master/biolink-model.owl
  file: biolink-model.owl
  download: true
  title: Biolink meta-model
-
  url:  http://purl.obolibrary.org/obo/chebi.owl
  file: chebi.owl
  download: true
  title: Chemical Entities of Biological Interest
- # maps to CURIE prefix: GO
  url:  http://purl.obolibrary.org/obo/go/extensions/go-plus.owl
  file: go-plus.owl
  title: Gene Ontology
  download: true

running on my MBP,

python -u multi_ont_to_json_kg.py --test curies-to-categories.yaml curies-to-urls-map.yaml ont-load-inventory-issue999.yaml test999-a.json
python -u get_nodes_json_from_kg_json.py --test test999-a.json test999-a-nodes.json
python -u multi_ont_to_json_kg.py --test curies-to-categories.yaml curies-to-urls-map.yaml ont-load-inventory-issue999.yaml test999-b.json
python -u get_nodes_json_from_kg_json.py --test test999-b.json test999-b-nodes.json
diff test999-a-nodes.json test999-b-nodes.json

I'm seeing a huge list of differences:

---
>             "category": "biolink:Metabolite",
>             "category label": "metabolite",
2223900,2223901c2223900,2223901
<             "category": "biolink:BiologicalEntity",
<             "category label": "biological_entity",
---
>             "category": "biolink:FunctionalAssociation",
>             "category label": "functional_association",
2269022,2269023c2269022,2269023
<             "category": "biolink:BiologicalEntity",
<             "category label": "biological_entity",
---
>             "category": "biolink:FunctionalAssociation",
>             "category label": "functional_association",
2269056,2269057c2269056,2269057
<             "category": "biolink:BiologicalEntity",
<             "category label": "biological_entity",
---
>             "category": "biolink:FunctionalAssociation",
>             "category label": "functional_association",
2326161,2326162c2326161,2326162
<             "category": "biolink:FunctionalAssociation",
<             "category label": "functional_association",
---
>             "category": "biolink:Drug",
>             "category label": "drug",
2365338,2365339c2365338,2365339
<             "category": "biolink:Drug",
<             "category label": "drug",
---
>             "category": "biolink:FunctionalAssociation",
>             "category label": "functional_association",
2365459,2365460c2365459,2365460
<             "category": "biolink:Drug",
<             "category label": "drug",
---
>             "category": "biolink:FunctionalAssociation",
>             "category label": "functional_association",
2376926,2376927c2376926,2376927
<             "category": "biolink:FunctionalAssociation",
<             "category label": "functional_association",
---
>             "category": "biolink:Drug",
>             "category label": "drug",
2392376,2392377c2392376,2392377
<             "category": "biolink:FunctionalAssociation",
<             "category label": "functional_association",
---
>             "category": "biolink:Drug",
>             "category label": "drug",
saramsey commented 4 years ago

Investigating CHEBI:75769, well lookee here, it has two parents in the ontology,

Screen Shot 2020-08-11 at 4 40 35 PM

Notes:

saramsey commented 4 years ago

Theory: the order of the items in the list for the outer loop is not consistent from run to run of multi_ont_to_json_kg.py, and it terminates on the first time through the loop due to the break on line 315:

Screen Shot 2020-08-11 at 4 45 23 PM
saramsey commented 4 years ago

Trying a fix now....

saramsey commented 4 years ago

Hi @ericawood can I close this issue out?

ecwood commented 4 years ago

Hi @saramsey, can't quite close this out. For the Snakemake vs Sequential build to ensure that the two build systems line up. Rather than diff-ing the two JSON files, I decided to diff the two edges.tsv and two nodes.tsv files, as they are separated line by line rather than all on one line. Both were using the same version of the code (I didn't git pull in between builds) and when I ran git status, my local copy was up to date with the exception of a couple of changes I made to version.sh and build-kg2.sh to prevent the code from updating to the S3 bucket -- I'm not totally sure how that works because I pulled on Friday. Here is some of the output:

For nodes_header.tsv

category        :LABEL  creation_date   deprecated      description     full_name       id:ID   iri     name    provided_by     publications:string[]   replaced_by     synonym:string[]        update_date     category_label

nodes.tsv diff (snippet of 596M file):

6794143c6794143
< biolink:Gene  gene            False   A protein coding gene LPPR1 in human. // COMMENTS: Category=external.; Type:protein-coding; Locus:9q31.1; NameStatus:official   phospholipid phosphatase related 1      NCBIGene:54886  https://identifiers.org/ncbigene:54886  LPPR1 (human)   OBO:pr.owl                      plasticity-related gene 3 protein; LPPR1; PLPPR1; PRG-3; lipid phosphate phosphatase-related protein type 1; phospholipid phosphatase-related protein type 1; plasticity related gene 3 2020-08-03 06:18:13 GMT gene^M
---
> biolink:Gene  gene            False   A protein coding gene LPPR1 in human. // COMMENTS: Category=external.; Type:protein-coding; Locus:9q31.1; NameStatus:official   phospholipid phosphatase related 1      NCBIGene:54886  https://identifiers.org/ncbigene:54886  LPPR1 (human)   OBO:pr.owl                      PRG-3; LPPR1; PLPPR1; lipid phosphate phosphatase-related protein type 1; phospholipid phosphatase-related protein type 1; plasticity related gene 3; plasticity-related gene 3 protein 2020-08-03 06:18:13 GMT gene^M
6796047c6796047
< biolink:Gene  gene            False   A protein coding gene LPPR4 in human. // COMMENTS: Category=external.; Type:protein-coding; Locus:1p21.3-p21.2; NameStatus:official     phospholipid phosphatase related 4      NCBIGene:9890   https://identifiers.org/ncbigene:9890   LPPR4 (human)   OBO:pr.owl                      plasticity related gene 1; LPPR4; LPR4; PHP1; PLPPR4; PRG-1; PRG1; brain-specific phosphatidic acid phosphatase-like protein 1; lipid phosphate phosphatase-related protein type 4; phospholipid phosphatase-related protein type 4; plasticity-related gene 1 protein  2020-08-03 06:18:13 GMT gene^M
---
> biolink:Gene  gene            False   A protein coding gene LPPR4 in human. // COMMENTS: Category=external.; Type:protein-coding; Locus:1p21.3-p21.2; NameStatus:official     phospholipid phosphatase related 4      NCBIGene:9890   https://identifiers.org/ncbigene:9890   LPPR4 (human)   OBO:pr.owl                      PHP1; LPPR4; LPR4; PLPPR4; PRG-1; PRG1; brain-specific phosphatidic acid phosphatase-like protein 1; lipid phosphate phosphatase-related protein type 4; phospholipid phosphatase-related protein type 4; plasticity related gene 1; plasticity-related gene 1 protein  2020-08-03 06:18:13 GMT gene^M

For edges_header.tsv

edge_label      negated :END_ID provided_by:string[]    publications:string[]   publications_info       relation        edge_label:TYPE simplified_relation     :START_ID       update_date     simplified_edge_label   subject object

edges.tsv diff (snippet of 32G file):

11143a11143
> xref  False   UMLS:C3703015   umls_source:CPT         {}      oboFormat:xref  close_match     skos:closeMatch CPT:27390       Sat Aug  1 00:43:02 2020        close_match     CPT:27390       UMLS:C3703015^M
11255d11254
< xref  False   UMLS:C3516672   umls_source:CPT         {}      oboFormat:xref  close_match     skos:closeMatch CPT:88381       Sat Aug  1 00:43:02 2020        close_match     CPT:88381       UMLS:C3516672^M
11256a11256
> xref  False   UMLS:C3516672   umls_source:CPT         {}      oboFormat:xref  close_match     skos:closeMatch CPT:88381       Sat Aug  1 00:43:02 2020        close_match     CPT:88381       UMLS:C3516672^M
11498d11497
< xref  False   UMLS:C3702541   umls_source:CPT         {}      oboFormat:xref  close_match     skos:closeMatch CPT:0339T       Sat Aug  1 00:43:02 2020        close_match     CPT:0339T       UMLS:C3702541^M
11500c11499
< xref  False   UMLS:C3517570   umls_source:CPT         {}      oboFormat:xref  close_match     skos:closeMatch CPT:0206T       Sat Aug  1 00:43:02 2020        close_match     CPT:0206T       UMLS:C3517570^M
---
> xref  False   UMLS:C3702541   umls_source:CPT         {}      oboFormat:xref  close_match     skos:closeMatch CPT:0339T       Sat Aug  1 00:43:02 2020        close_match     CPT:0339T       UMLS:C3702541^M
11501a11501
> xref  False   UMLS:C3517570   umls_source:CPT         {}      oboFormat:xref  close_match     skos:closeMatch CPT:0206T       Sat Aug  1 00:43:02 2020        close_match     CPT:0206T       UMLS:C3517570^M
11520d11519
< xref  False   UMLS:C3865608   umls_source:CPT         {}      oboFormat:xref  close_match     skos:closeMatch CPT:27370       Sat Aug  1 00:43:02 2020        close_match     CPT:27370       UMLS:C3865608^M
11521a11521
> xref  False   UMLS:C3865608   umls_source:CPT         {}      oboFormat:xref  close_match     skos:closeMatch CPT:27370       Sat Aug  1 00:43:02 2020        close_match     CPT:27370       UMLS:C3865608^M
11564d11563
< xref  False   UMLS:C0374537   umls_source:CPT         {}      oboFormat:xref  close_match     skos:closeMatch CPT:99058       Sat Aug  1 00:43:02 2020        close_match     CPT:99058       UMLS:C0374537^M
11565a11565
> xref  False   UMLS:C0374537   umls_source:CPT         {}      oboFormat:xref  close_match     skos:closeMatch CPT:99058       Sat Aug  1 00:43:02 2020        close_match     CPT:99058       UMLS:C0374537^M
11588d11587
< xref  False   UMLS:C3517573   umls_source:CPT         {}      oboFormat:xref  close_match     skos:closeMatch CPT:0209T       Sat Aug  1 00:43:02 2020        close_match     CPT:0209T       UMLS:C3517573^M

I think that the nodes issue is part of this issue. I am unsure about the edges issue, since I don't remember that being significant before (but I also wasn't running diff, I was running a script that compared each element with the other dictionary to see if it was present anywhere).

ecwood commented 4 years ago

Viewing the bottom of the nodes.tsv diff, I suspect this is also an issue with the NCBIGene ETL:

9135161,9135163c9135161,9135163
< biolink:Gene  gene            False   CD209 promoter region; Type:biological-region; Locus:19p; NameStatus:unofficial CD209 promoter region   NCBIGene:117307477      https://identifiers.org/ncbigene:117307477      CD209 promoter region   identifiers_org_registry:ncbigene                       LOC117307477; DC-SIGN promoter  20200801        gene^M
< biolink:Gene  gene            False   CLEC4M promoter region; Type:biological-region; Locus:19p; NameStatus:unofficial        CLEC4M promoter region  NCBIGene:117307478      https://identifiers.org/ncbigene:117307478      CLEC4M promoter region  identifiers_org_registry:ncbigene                       LOC117307478; DC-SIGNR promoter 20200801        gene^M
< biolink:Gene  gene            False   Type:pseudo; Locus:2q21.2; NameStatus:official  CDRT15 pseudogene 4     NCBIGene:117314529      https://identifiers.org/ncbigene:117314529      CDRT15 pseudogene 4     identifiers_org_registry:ncbigene                       CDRT15P4; CMT1A duplicated region transcript 15 pseudogene 4    20200801        gene^M
---
> biolink:Gene  gene            False   CD209 promoter region; Type:biological-region; Locus:19p; NameStatus:unofficial CD209 promoter region   NCBIGene:117307477      https://identifiers.org/ncbigene:117307477      CD209 promoter region   identifiers_org_registry:ncbigene                       DC-SIGN promoter; LOC117307477  20200801        gene^M
> biolink:Gene  gene            False   CLEC4M promoter region; Type:biological-region; Locus:19p; NameStatus:unofficial        CLEC4M promoter region  NCBIGene:117307478      https://identifiers.org/ncbigene:117307478      CLEC4M promoter region  identifiers_org_registry:ncbigene                       DC-SIGNR promoter; LOC117307478 20200801        gene^M
> biolink:Gene  gene            False   Type:pseudo; Locus:2q21.2; NameStatus:official  CDRT15 pseudogene 4     NCBIGene:117314529      https://identifiers.org/ncbigene:117314529      CDRT15 pseudogene 4     identifiers_org_registry:ncbigene                       CMT1A duplicated region transcript 15 pseudogene 4; CDRT15P4    20200801        gene^M
9135195c9135195
< biolink:Gene  gene            False   Type:pseudo; Locus:Yq11.222; NameStatus:official        elongin C pseudogene 26 NCBIGene:118097967      https://identifiers.org/ncbigene:118097967      elongin C pseudogene 26 identifiers_org_registry:ncbigene                       TCEB1P26; transcription elongation factor B (SIII), polypeptide 1 pseudogene 26; ELOCP26; transcription elongation factor B subunit 1 pseudogene 26; ELOC26     20200627        gene^M
---
> biolink:Gene  gene            False   Type:pseudo; Locus:Yq11.222; NameStatus:official        elongin C pseudogene 26 NCBIGene:118097967      https://identifiers.org/ncbigene:118097967      elongin C pseudogene 26 identifiers_org_registry:ncbigene                       transcription elongation factor B subunit 1 pseudogene 26; ELOC26; ELOCP26; TCEB1P26; transcription elongation factor B (SIII), polypeptide 1 pseudogene 26     20200627        gene^M
9135197c9135197
< biolink:Gene  gene            False   GUCA1ANB-GUCA1A readthrough; Type:protein-coding; NameStatus:unofficial GUCA1ANB-GUCA1A readthrough     NCBIGene:118142757      https://identifiers.org/ncbigene:118142757      GUCA1ANB-GUCA1A readthrough     identifiers_org_registry:ncbigene                       GUCA1A; GUCA1ANB-GUCA1A; guanylyl cyclase-activating protein 1  20200722        gene^M
---
> biolink:Gene  gene            False   GUCA1ANB-GUCA1A readthrough; Type:protein-coding; NameStatus:unofficial GUCA1ANB-GUCA1A readthrough     NCBIGene:118142757      https://identifiers.org/ncbigene:118142757      GUCA1ANB-GUCA1A readthrough     identifiers_org_registry:ncbigene                       guanylyl cyclase-activating protein 1; GUCA1ANB-GUCA1A; GUCA1A  20200722        gene^M

Here are some from ChemBL:

7885691,7885694c7885691,7885694
< biolink:ChemicalSubstance     chemical_substance              False   SID14735976; FULL_MW:265.40; MAX_FDA_APPROVAL_PHASE: 0  SID14735976     CHEMBL.COMPOUND:CHEMBL1368645   https://identifiers.org/chembl.compound:CHEMBL1368645   SID14735976     identifiers_org_registry:chembl                 InChI=1S/C16H27NO2/c1-13-11-14(2)16(15(3)12-13)19-10-6-4-5-7-17-8-9-18/h11-12,17-18H,4-10H2,1-3H3; SEJLAPHMCNYZHT-UHFFFAOYSA-N; Cc1cc(C)c(OCCCCCNCCO)c(C)c1; PUBCHEM_BIOASSAY:14735976; SID14735976     2018-12-10      chemical_substance^M
< biolink:ChemicalSubstance     chemical_substance              False   SID24389608; FULL_MW:416.52; MAX_FDA_APPROVAL_PHASE: 0  SID24389608     CHEMBL.COMPOUND:CHEMBL1368646   https://identifiers.org/chembl.compound:CHEMBL1368646   SID24389608     identifiers_org_registry:chembl                 InChI=1S/C26H28N2O3/c1-18-7-12-24-21(17-31-26(24)19(18)2)14-25(29)28(16-23-6-5-13-30-23)15-20-8-10-22(11-9-20)27(3)4/h5-13,17H,14-16H2,1-4H3; YGUZVOHOFOCSDH-UHFFFAOYSA-N; CN(C)c1ccc(CN(Cc2occc2)C(=O)Cc3coc4c(C)c(C)ccc34)cc1; PUBCHEM_BIOASSAY:24389608; SID24389608 2018-12-10      chemical_substance^M
< biolink:ChemicalSubstance     chemical_substance              False   SID22414438; FULL_MW:328.41; MAX_FDA_APPROVAL_PHASE: 0  SID22414438     CHEMBL.COMPOUND:CHEMBL1368647   https://identifiers.org/chembl.compound:CHEMBL1368647   SID22414438     identifiers_org_registry:chembl                 InChI=1S/C19H24N2O3/c1-19(2,3)16-12-17(21-24-16)20-18(22)13-8-10-15(11-9-13)23-14-6-4-5-7-14/h8-12,14H,4-7H2,1-3H3,(H,20,21,22); TUGIUFAVHHMDBA-UHFFFAOYSA-N; CC(C)(C)c1onc(NC(=O)c2ccc(OC3CCCC3)cc2)c1; PUBCHEM_BIOASSAY:22414438; SID22414438 2018-12-10      chemical_substance^M
< biolink:ChemicalSubstance     chemical_substance              False   SID24817338; FULL_MW:267.24; MAX_FDA_APPROVAL_PHASE: 0  SID24817338     CHEMBL.COMPOUND:CHEMBL1368649   https://identifiers.org/chembl.compound:CHEMBL1368649   SID24817338     identifiers_org_registry:chembl                 InChI=1S/C14H9N3O3/c18-17(19)10-5-7-11(8-6-10)20-14-9-15-12-3-1-2-4-13(12)16-14/h1-9H; RZNQBDUFQHFIRM-UHFFFAOYSA-N; O-N+(=O)c1ccc(Oc2cnc3ccccc3n2)cc1; PUBCHEM_BIOASSAY:24817338; SID24817338   2018-12-10      chemical_substance^M
---
> biolink:ChemicalSubstance     chemical_substance              False   SID14735976; FULL_MW:265.40; MAX_FDA_APPROVAL_PHASE: 0  SID14735976     CHEMBL.COMPOUND:CHEMBL1368645   https://identifiers.org/chembl.compound:CHEMBL1368645   SID14735976     identifiers_org_registry:chembl                 InChI=1S/C16H27NO2/c1-13-11-14(2)16(15(3)12-13)19-10-6-4-5-7-17-8-9-18/h11-12,17-18H,4-10H2,1-3H3; SEJLAPHMCNYZHT-UHFFFAOYSA-N; Cc1cc(C)c(OCCCCCNCCO)c(C)c1; SID14735976; PUBCHEM_BIOASSAY:14735976     2018-12-10      chemical_substance^M
> biolink:ChemicalSubstance     chemical_substance              False   SID24389608; FULL_MW:416.52; MAX_FDA_APPROVAL_PHASE: 0  SID24389608     CHEMBL.COMPOUND:CHEMBL1368646   https://identifiers.org/chembl.compound:CHEMBL1368646   SID24389608     identifiers_org_registry:chembl                 InChI=1S/C26H28N2O3/c1-18-7-12-24-21(17-31-26(24)19(18)2)14-25(29)28(16-23-6-5-13-30-23)15-20-8-10-22(11-9-20)27(3)4/h5-13,17H,14-16H2,1-4H3; YGUZVOHOFOCSDH-UHFFFAOYSA-N; CN(C)c1ccc(CN(Cc2occc2)C(=O)Cc3coc4c(C)c(C)ccc34)cc1; SID24389608; PUBCHEM_BIOASSAY:24389608 2018-12-10      chemical_substance^M
> biolink:ChemicalSubstance     chemical_substance              False   SID22414438; FULL_MW:328.41; MAX_FDA_APPROVAL_PHASE: 0  SID22414438     CHEMBL.COMPOUND:CHEMBL1368647   https://identifiers.org/chembl.compound:CHEMBL1368647   SID22414438     identifiers_org_registry:chembl                 InChI=1S/C19H24N2O3/c1-19(2,3)16-12-17(21-24-16)20-18(22)13-8-10-15(11-9-13)23-14-6-4-5-7-14/h8-12,14H,4-7H2,1-3H3,(H,20,21,22); TUGIUFAVHHMDBA-UHFFFAOYSA-N; CC(C)(C)c1onc(NC(=O)c2ccc(OC3CCCC3)cc2)c1; SID22414438; PUBCHEM_BIOASSAY:22414438 2018-12-10      chemical_substance^M
> biolink:ChemicalSubstance     chemical_substance              False   SID24817338; FULL_MW:267.24; MAX_FDA_APPROVAL_PHASE: 0  SID24817338     CHEMBL.COMPOUND:CHEMBL1368649   https://identifiers.org/chembl.compound:CHEMBL1368649   SID24817338     identifiers_org_registry:chembl                 InChI=1S/C14H9N3O3/c18-17(19)10-5-7-11(8-6-10)20-14-9-15-12-3-1-2-4-13(12)16-14/h1-9H; RZNQBDUFQHFIRM-UHFFFAOYSA-N; O-N+(=O)c1ccc(Oc2cnc3ccccc3n2)cc1; SID24817338; PUBCHEM_BIOASSAY:24817338   2018-12-10      chemical_substance^M

Here are some from EnsemBL:

7112989c7112989
< biolink:Gene  gene            False           chromosome X open reading frame 51A [Source:HGNC Symbol;Acc:HGNC:30533] ENSEMBL:ENSG00000224440 https://identifiers.org/ensembl:ENSG00000224440 chromosome X open reading frame 51A [Source:HGNC Symbol;Acc:HGNC:30533] identifiers_org_registry:ensembl                        CXorf51A; HGNC:30533; 100129239; A0A1B0GTR3     2019-03 gene^M
---
> biolink:Gene  gene            False           chromosome X open reading frame 51A [Source:HGNC Symbol;Acc:HGNC:30533] ENSEMBL:ENSG00000224440 https://identifiers.org/ensembl:ENSG00000224440 chromosome X open reading frame 51A [Source:HGNC Symbol;Acc:HGNC:30533] identifiers_org_registry:ensembl                        CXorf51A; A0A1B0GTR3; 100129239; HGNC:30533     2019-03 gene^M
7112996c7112996
< biolink:Gene  gene            False           Abelson helper integration site 1 [Source:HGNC Symbol;Acc:HGNC:21575]   ENSEMBL:ENSG00000135541 https://identifiers.org/ensembl:ENSG00000135541 Abelson helper integration site 1 [Source:HGNC Symbol;Acc:HGNC:21575]   identifiers_org_registry:ensembl                        AHI1; E9PML3; R-HSA-5620912; Q8N157; 608894; R-HSA-1852241; R-HSA-5617833; 54806; E9PI51; HGNC:21575; 608629; Q9NQN3    2019-03 gene^M
---
> biolink:Gene  gene            False           Abelson helper integration site 1 [Source:HGNC Symbol;Acc:HGNC:21575]   ENSEMBL:ENSG00000135541 https://identifiers.org/ensembl:ENSG00000135541 Abelson helper integration site 1 [Source:HGNC Symbol;Acc:HGNC:21575]   identifiers_org_registry:ensembl                        AHI1; Q9NQN3; R-HSA-5620912; R-HSA-5617833; 54806; Q8N157; 608894; HGNC:21575; E9PML3; R-HSA-1852241; 608629; E9PI51    2019-03 gene^M
7112999c7112999
< biolink:Gene  gene            False           RNA, U1 small nuclear 52, pseudogene [Source:HGNC Symbol;Acc:HGNC:48394]        ENSEMBL:ENSG00000206917 https://identifiers.org/ensembl:ENSG00000206917 RNA, U1 small nuclear 52, pseudogene [Source:HGNC Symbol;Acc:HGNC:48394]        identifiers_org_registry:ensembl                        RNU1-52P; HGNC:48394; RF00003   2019-03 gene^M
---
> biolink:Gene  gene            False           RNA, U1 small nuclear 52, pseudogene [Source:HGNC Symbol;Acc:HGNC:48394]        ENSEMBL:ENSG00000206917 https://identifiers.org/ensembl:ENSG00000206917 RNA, U1 small nuclear 52, pseudogene [Source:HGNC Symbol;Acc:HGNC:48394]        identifiers_org_registry:ensembl                        RNU1-52P; RF00003; HGNC:48394   2019-03 gene^M

Here are some from UniProtKB:

7112155,7112156c7112155,7112156
< biolink:Protein       protein 31-OCT-2006     False   -!- CAUTION: Product of a dubious gene prediction. {ECO:0000305}.       Putative uncharacterized protein LOC642776      UniProtKB:Q9BTK2        https://identifiers.org/uniprot:Q9BTK2  Putative uncharacterized protein LOC642776      identifiers_org_registry:uniprot        PMID:16710414; PMID:15489334; DOI:10.1101/gr.2596504; DOI:10.1038/nature04727           SEQUENCE   45 AA;  4916 MW;  6A3D21D5765D6950 CRC64;\nMEGGMAAYPV ATRESRCRRG RIGVQPSPER RSEVVGPFPL ARSLS\n       11-DEC-2019     protein^M
< biolink:Protein       protein 26-FEB-2008     False           Putative uncharacterized protein FLJ39060       UniProtKB:Q8N8P6        https://identifiers.org/uniprot:Q8N8P6  Putative uncharacterized protein FLJ39060       identifiers_org_registry:uniprot        DOI:10.1038/ng1285; PMID:14702039; PMID:15772651; DOI:10.1038/nature03440               SEQUENCE   123 AA;  13896 MW;  E5DED0143EDCD905 CRC64;\nMVDGRTRTII NDIFFTEPTP EMSSLPVRSH SSLSLNLVSL MVICRGIIKL VIHFRMYCPP\nRLKAKHIEPT LRPVPLKELR ISHWPNECIR HSASVPMATG ANGLETKDET KRNAEKCACS\nVFL\n     12-AUG-2020     protein^M
---
> biolink:Protein       protein 31-OCT-2006     False   -!- CAUTION: Product of a dubious gene prediction. {ECO:0000305}.       Putative uncharacterized protein LOC642776      UniProtKB:Q9BTK2        https://identifiers.org/uniprot:Q9BTK2  Putative uncharacterized protein LOC642776      identifiers_org_registry:uniprot        PMID:16710414; PMID:15489334; DOI:10.1038/nature04727; DOI:10.1101/gr.2596504           SEQUENCE   45 AA;  4916 MW;  6A3D21D5765D6950 CRC64;\nMEGGMAAYPV ATRESRCRRG RIGVQPSPER RSEVVGPFPL ARSLS\n       11-DEC-2019     protein^M
> biolink:Protein       protein 26-FEB-2008     False           Putative uncharacterized protein FLJ39060       UniProtKB:Q8N8P6        https://identifiers.org/uniprot:Q8N8P6  Putative uncharacterized protein FLJ39060       identifiers_org_registry:uniprot        DOI:10.1038/nature03440; PMID:14702039; PMID:15772651; DOI:10.1038/ng1285               SEQUENCE   123 AA;  13896 MW;  E5DED0143EDCD905 CRC64;\nMVDGRTRTII NDIFFTEPTP EMSSLPVRSH SSLSLNLVSL MVICRGIIKL VIHFRMYCPP\nRLKAKHIEPT LRPVPLKELR ISHWPNECIR HSASVPMATG ANGLETKDET KRNAEKCACS\nVFL\n     12-AUG-2020     protein^M
7112364c7112364
< biolink:Protein       protein 21-MAR-2012     False   -!- ALTERNATIVE PRODUCTS:Event=Alternative splicing; Named isoforms=3;Name=2; Synonyms=ZHX1-C8orf76;IsoId=Q96EF9-1; Sequence=Displayed;Name=1;IsoId=Q9UKY1-1; Sequence=External;Name=3;IsoId=Q96K31-1; Sequence=External;-!- MISCELLANEOUS: [Isoform 2]: Based on a readthrough transcript which mayproduce a ZHX1-C8orf76 fusion protein.      Zinc fingers and homeoboxes protein 1, isoform 2        UniProtKB:Q96EF9        https://identifiers.org/uniprot:Q96EF9  ZHX1-C8orf76    identifiers_org_registry:uniprot        PMID:16421571; PMID:15489334; DOI:10.1101/gr.2596504; DOI:10.1038/nature04406           ZHX1-C8orf76; SEQUENCE   292 AA;  33285 MW;  3F5E55E9C6F23D93 CRC64;\nMLRKLWQWFY EETESSDDVE VLTLKKFKGD LAYRRQEYQK ALQEYSSISE KLSSTNFAMK\nRDVQEGQARC LAHLGRHMEA LEIAANLENK ATNTDHLTTV LYLQLAICSS LQNLEKTIFC\nLQKLISLHPF NPWNWGKLAE AYLNLGPALS AALASSQKQH SFTSSDKTIK SFFPHSGKDC\nLLCFPETLPE SSLFSVEANS SNSQKNEKAL TNIQNCMAEK RETVLIETQL KACASFIRTR\nLLLQFTQPQQ TSFALERNLR TQQEIEDKMK GFSFKEDTLL LIAEVSVSLG FM\n; ZHX1-C8orf76 readthrough transcript protein      12-AUG-2020     protein^M
---
> biolink:Protein       protein 21-MAR-2012     False   -!- ALTERNATIVE PRODUCTS:Event=Alternative splicing; Named isoforms=3;Name=2; Synonyms=ZHX1-C8orf76;IsoId=Q96EF9-1; Sequence=Displayed;Name=1;IsoId=Q9UKY1-1; Sequence=External;Name=3;IsoId=Q96K31-1; Sequence=External;-!- MISCELLANEOUS: [Isoform 2]: Based on a readthrough transcript which mayproduce a ZHX1-C8orf76 fusion protein.      Zinc fingers and homeoboxes protein 1, isoform 2        UniProtKB:Q96EF9        https://identifiers.org/uniprot:Q96EF9  ZHX1-C8orf76    identifiers_org_registry:uniprot        PMID:15489334; PMID:16421571; DOI:10.1101/gr.2596504; DOI:10.1038/nature04406           ZHX1-C8orf76; SEQUENCE   292 AA;  33285 MW;  3F5E55E9C6F23D93 CRC64;\nMLRKLWQWFY EETESSDDVE VLTLKKFKGD LAYRRQEYQK ALQEYSSISE KLSSTNFAMK\nRDVQEGQARC LAHLGRHMEA LEIAANLENK ATNTDHLTTV LYLQLAICSS LQNLEKTIFC\nLQKLISLHPF NPWNWGKLAE AYLNLGPALS AALASSQKQH SFTSSDKTIK SFFPHSGKDC\nLLCFPETLPE SSLFSVEANS SNSQKNEKAL TNIQNCMAEK RETVLIETQL KACASFIRTR\nLLLQFTQPQQ TSFALERNLR TQQEIEDKMK GFSFKEDTLL LIAEVSVSLG FM\n; ZHX1-C8orf76 readthrough transcript protein      12-AUG-2020     protein^M
ecwood commented 4 years ago

FYI, the report files for the two builds were identical (with the exception of the report generated time).

saramsey commented 4 years ago

Excellent detective work.

OK, so it looks like the synonym list ordering is variable from run to run, is that correct?

ecwood commented 4 years ago

Excellent detective work.

OK, so it looks like the synonym list ordering is variable from run to run, is that correct?

Yes, that was my conclusion as well.

saramsey commented 4 years ago

Good idea to diff the nodes.tsv files. Which direction is the diff? i.e., is < the sequential build or is > the sequential build?

ecwood commented 4 years ago

On kg2-steve in kg2-build: Edges: diff SequentialBuild/edges.tsv SnakemakeBuild/edges.tsv > edges_tsv_diff.log

Nodes diff SequentialBuild/nodes.tsv SnakemakeBuild/nodes.tsv > nodes_tsv_diff.log

saramsey commented 4 years ago

Thanks, this is well organized.

saramsey commented 4 years ago

For the moment, I'm focusing on the nodes issue.

saramsey commented 4 years ago

Did you mean ParallelBuild instead of SnakemakeBuild ?

saramsey commented 4 years ago

Can you confirm -- you did a test build, is that correct?

ecwood commented 4 years ago

Hi @saramsey, 1) The data is in SnakemakeBuild on kg2steve: image 2) No, this was a partial build without test flags

saramsey commented 4 years ago

Sorry, I was looking on kg2dev. My bad.

saramsey commented 4 years ago

Switching over to kg2steve now....

saramsey commented 4 years ago

On kg2steve, I am running

python ~/kg2-code/get_nodes_json_from_kg_json.py --test kg2-simplified.json kg2-simplified-nodes.json

so I can have a look at the NCBIGene:54886 node

saramsey commented 4 years ago

I posit that this (and analogous code in other modules) in ncbigene_tsv_to_kg.py may be our smoking gun:

Screen Shot 2020-08-25 at 1 49 00 PM

For NCBIGene:54886, the ordering of synonym entries in the file /home/ubuntu/kg2-build/SequentialBuild/kg2-simplified-nodes.json on kg2steve.rtx.ai is clearly incorrect (official gene symbol PLPPR1 is not first):

Screen Shot 2020-08-25 at 1 50 57 PM
saramsey commented 4 years ago

I propose the following remedy to ncbigene_tsv_to_kg_json.py:

Screen Shot 2020-08-25 at 1 54 36 PM
saramsey commented 4 years ago

I note that /home/ubuntu/kg2-build/kg2-ncbigene.json is also incorrect in the same way:

Screen Shot 2020-08-25 at 1 57 42 PM

This increases my suspicion that the problem is in ncbigene_tsv_to_kg_json.py.

saramsey commented 4 years ago

Looks like my proposed fix is helping:

Screen Shot 2020-08-25 at 1 59 38 PM
ecwood commented 4 years ago

Edges from UniProt:

30825334a30825332
> physically_interacts_with     False   CHEBI:57394     identifiers_org_registry:uniprot                {}      biolink:physically_interacts_with       physically_interacts_with       biolink:physically_interacts_with       UniProtKB:Q9NXF8        12-AUG-2020     physically_interacts_with       UniProtKB:Q9NXF8        CHEBI:57394^M
30825337c30825335,30825336
< physically_interacts_with     False   CHEBI:143199    identifiers_org_registry:uniprot                {}      biolink:physically_interacts_with       physically_interacts_with       biolink:physically_interacts_with       UniProtKB:Q9NXF8        12-AUG-2020     physically_interacts_with       UniProtKB:Q9NXF8        CHEBI:143199^M
---
> physically_interacts_with     False   CHEBI:74151     identifiers_org_registry:uniprot                {}      biolink:physically_interacts_with       physically_interacts_with       biolink:physically_interacts_with       UniProtKB:Q9NXF8        12-AUG-2020     physically_interacts_with       UniProtKB:Q9NXF8        CHEBI:74151^M
> physically_interacts_with     False   CHEBI:143200    identifiers_org_registry:uniprot                {}      biolink:physically_interacts_with       physically_interacts_with       biolink:physically_interacts_with       UniProtKB:Q9NXF8        12-AUG-2020     physically_interacts_with       UniProtKB:Q9NXF8        CHEBI:143200^M
30825338a30825338,30825339
> physically_interacts_with     False   CHEBI:143199    identifiers_org_registry:uniprot                {}      biolink:physically_interacts_with       physically_interacts_with       biolink:physically_interacts_with       UniProtKB:Q9NXF8        12-AUG-2020     physically_interacts_with       UniProtKB:Q9NXF8        CHEBI:143199^M
> physically_interacts_with     False   CHEBI:57287     identifiers_org_registry:uniprot                {}      biolink:physically_interacts_with       physically_interacts_with       biolink:physically_interacts_with       UniProtKB:Q9ULC8        12-AUG-2020     physically_interacts_with       UniProtKB:Q9ULC8        CHEBI:57287^M
saramsey commented 4 years ago

Hi @ericawood just checking back on this... do you think the above commits helped?

ecwood commented 4 years ago

Hi @saramsey, I still haven't tested yet (please see my slack message). May I use kg2steve for testing this as it has constant pickle files for the ontologies? My plan is to first do an alltest build using Snakemake. Then, a test build using build-kg2.sh. Following that, I will compare the two.

ecwood commented 4 years ago

Hi @saramsey, This issue was fixed in the test build I ran between Snakemake and Sequential. Should I close it out?