ExposuresProvider / cam-pipeline

Data loading pipeline for CAM database
https://exposuresprovider.github.io/cam-pipeline/
MIT License
2 stars 4 forks source link

Investigate why some identifiers aren't being normalized and report to NodeNorm #93

Open gaurav opened 1 year ago

gaurav commented 1 year ago

In the CAM-KP 2023-04-20 release, we've got a large number of nodes that could not be normalized:

      1 BFO
      1 EMAP
      1 EMPA
      1 MA
      1 PRO
      1 RGD
      1 WBPhenotype
      2 CL
      2 RefSeq
      2 WormBase
      3 CHEBI
      3 ENA
      3 RNACENTRAL
      4 ZFIN
      5 ComplexPortal
      7 PR
      8 SO
      9 taxon
     12 WBbt
     16 MGI
     20 XAO
     25 GO
     91 Flybase
     93 SGD
     94 UniProtKB
    131 ZFA
    343 Xenbase
   1147 "ENSEMBL
   1356 EMAPA
  17433 REACTO

This differs significantly from the previous CAM-KP 2023-04-14 release node norm report in one major way -- lots more REACTO identifiers in this one. The previous node norm failures were:

      1 AspGD
      1 EMAP
      1 EMPA
      1 HGNC
      1 MA
      1 PATO
      1 PMID
      1 PRO
      1 RGD
      1 UBERON
      2 CL
      2 PseudoCAP
      2 WBls
      2 WBPhenotype
      3 AGI_LocusCode
      3 ENA
      3 PO
      3 RefSeq
      4 ZFIN
      4 ZFS
      6 CHEBI
      6 DDANAT
      8 PR
      9 RNACENTRAL
      9 taxon
     11 ComplexPortal
     11 CTDI
     16 SO
     18 UniProtKB
     19 MGI
     21 tair.locus
     25 XAO
     33 WormBase
     34 MESH
     47 GO
     52 WBbt
     66 PomBase
     98 SGD
    117 Flybase
    142 REACTO
    177 ZFA
    393 Xenbase
    611 NCBIGene
   1359 EMAPA
gaurav commented 1 year ago

Some identifiers that failed normalization on the CAM-KP/2023-04-20 run that might be worth digging into: