biopragmatics / bioregistry

đź“® An integrative registry of biological databases, ontologies, and nomenclatures.
https://bioregistry.io
MIT License
119 stars 51 forks source link

Add a CLI method to determine if a JSON context is "bioregistry conformant" #490

Open matentzn opened 2 years ago

matentzn commented 2 years ago

It would be super cool if we could build our own contexts and validate them against bioregistry as part of our CI. I think this would significantly encourage standardisation. I would use this for all projects..

The method should be able to specify a bioregistry managed context by name, and take as an input another context that is validated. Why not use the bioregistry validated context directly? It is likely we will want to keep cached subsets contexts of the huge and growing contexts in bioregistry (specifying maybe 10 prefixes).

Maybe its a dumb idea but I would use it.

cthoyt commented 2 years ago

sure, better as a CLI or a python function?

In https://github.com/biopragmatics/bioregistry/pull/416 I started creating utilities for validating/working with data in pandas dataframes as well, so this is a nice future direction for the package.

matentzn commented 2 years ago

Ultimately it should be a CLI function, as I would want to weave this into CI pipelines based on shell commands and make. I am really having command line users in mind that do not necessarily want to dork with pything scripts.

cthoyt commented 2 years ago

so like who are your target people in mind and how will they feel about most of their stuff being wrong? E.g., I ran bioregistry validate jsonld "https://raw.githubusercontent.com/prefixcommons/prefixcommons-py/master/prefixcommons/registry/go_context.jsonld" --relax and got this output:

BIOMD - nonstandard > Switch to standard prefix: biomodels.db
COG_Function - invalid
WB - nonstandard > Switch to standard prefix: wormbase
FBbt - nonstandard > Switch to standard prefix: fbbt
KEGG_LIGAND - nonstandard > Switch to standard prefix: kegg.ligand
PSO_GIT - invalid
MaizeGDB_stock - invalid
EMAPA - nonstandard > Switch to standard prefix: emapa
GO - nonstandard > Switch to standard prefix: go
NCBI_GP - invalid
NMPDR - invalid
CASSPC - nonstandard > Switch to standard prefix: casspc
TGD_REF - invalid
NCBIGene - nonstandard > Switch to standard prefix: ncbigene
KEGG_REACTION - nonstandard > Switch to standard prefix: kegg.reaction
PseudoCAP - invalid
UniPathway - nonstandard > Switch to standard prefix: upa
MEROPS_fam - invalid
GO_REF - nonstandard > Switch to standard prefix: go.ref
VEGA - nonstandard > Switch to standard prefix: vega
ZFIN - nonstandard > Switch to standard prefix: zfin
AspGD_REF - invalid
RO - nonstandard > Switch to standard prefix: ro
Pfam - nonstandard > Switch to standard prefix: pfam
UBERON - nonstandard > Switch to standard prefix: uberon
GR - invalid
PDB - nonstandard > Switch to standard prefix: pdb
CORIELL - nonstandard > Switch to standard prefix: coriell
JCVI_GenProp - invalid
SGN - nonstandard > Switch to standard prefix: sgn
BFO - nonstandard > Switch to standard prefix: bfo
Genesys-pgr - invalid
UniMod - nonstandard > Switch to standard prefix: unimod
UM-BBD_reactionID - nonstandard > Switch to standard prefix: umbbd.reaction
PubChem_Substance - nonstandard > Switch to standard prefix: pubchem.substance
EcoCyc - nonstandard > Switch to standard prefix: ecocyc
Reactome - nonstandard > Switch to standard prefix: reactome
InterPro - nonstandard > Switch to standard prefix: interpro
UniRule - nonstandard > Switch to standard prefix: unirule
MGCSC_GENETIC_STOCKS - invalid
dictyBase - nonstandard > Switch to standard prefix: dictybase
PO_GIT - invalid
AspGD_LOCUS - nonstandard > Switch to standard prefix: aspgd.locus
SGD - nonstandard > Switch to standard prefix: sgd
COG_Pathway - nonstandard > Switch to standard prefix: cog.pathway
ENZYME - invalid
PAMGO_MGG - invalid
AgBase - invalid
AraCyc - invalid
EcoCyc_REF - invalid
CHEBI - nonstandard > Switch to standard prefix: chebi
HGNC - nonstandard > Switch to standard prefix: hgnc
dictyBase_gene_name - invalid
TAIR - invalid
EnsemblFungi - nonstandard > Switch to standard prefix: ensembl.fungi
Wikipedia - nonstandard > Switch to standard prefix: wikipedia.en
SUPERFAMILY - invalid
SWALL - invalid
PSI-MOD - nonstandard > Switch to standard prefix: mod
FYPO - nonstandard > Switch to standard prefix: fypo
RGD - nonstandard > Switch to standard prefix: rgd
UM-BBD_enzymeID - nonstandard > Switch to standard prefix: umbbd.enzyme
Broad_MGG - invalid
Swiss-Prot - nonstandard > Switch to standard prefix: uniprot
PMID - nonstandard > Switch to standard prefix: pubmed
Xenbase - nonstandard > Switch to standard prefix: xenbase
PR - nonstandard > Switch to standard prefix: pr
MIPS_funcat - invalid
GR_REF - invalid
MaizeGDB - nonstandard > Switch to standard prefix: maizegdb.locus
HAMAP - nonstandard > Switch to standard prefix: hamap
SGN_ref - invalid
TO_GIT - invalid
MeSH - nonstandard > Switch to standard prefix: mesh
GR_PROTEIN - nonstandard > Switch to standard prefix: gramene.protein
MaizeGDB_REF - invalid
GEO - nonstandard > Switch to standard prefix: geo
PO - nonstandard > Switch to standard prefix: po
PomBase - nonstandard > Switch to standard prefix: pombase
ENA - nonstandard > Switch to standard prefix: ena.embl
PIRSF - nonstandard > Switch to standard prefix: pirsf
EMBL - invalid
Prosite - nonstandard > Switch to standard prefix: prosite
H-invDB_cDNA - invalid
EC - nonstandard > Switch to standard prefix: eccode
MACSC_REF - invalid
PAMGO_VMD - invalid
IRGC - invalid
NASC_code - invalid
COG_Cluster - nonstandard > Switch to standard prefix: cog
TreeGenes - invalid
WB_REF - nonstandard > Switch to standard prefix: wormbase
TGD_LOCUS - invalid
MA - nonstandard > Switch to standard prefix: ma
UniProtKB - nonstandard > Switch to standard prefix: uniprot
MGI - nonstandard > Switch to standard prefix: mgi
GRINDesc - invalid
DDANAT - nonstandard > Switch to standard prefix: ddanat
RAP-DB - invalid
gomodel - nonstandard > Switch to standard prefix: go.model
KEGG_PATHWAY - nonstandard > Switch to standard prefix: kegg.pathway
BTO - nonstandard > Switch to standard prefix: bto
JCVI_CMR - invalid
dictyBase_REF - invalid
DOI - nonstandard > Switch to standard prefix: doi
LIFEdb - invalid
PANTHER - invalid
Gene3D - invalid
PATRIC - invalid
FB - nonstandard > Switch to standard prefix: flybase
PAINT_REF - invalid
CASREF - invalid
ENSEMBL - nonstandard > Switch to standard prefix: ensembl
SMART - nonstandard > Switch to standard prefix: smart
RefSeq - nonstandard > Switch to standard prefix: refseq
WBls - nonstandard > Switch to standard prefix: wbls
MaizeGDB_QTL - invalid
SOY_ref - invalid
ECO - nonstandard > Switch to standard prefix: eco
CGD_REF - invalid
ECK - invalid
CGD - nonstandard > Switch to standard prefix: cgd
GR_GENE - nonstandard > Switch to standard prefix: gramene.gene
RNAmods - nonstandard > Switch to standard prefix: rnamods
KEGG_ENZYME - nonstandard > Switch to standard prefix: kegg.enzyme
CACAO - invalid
IUPHAR_GPCR - nonstandard > Switch to standard prefix: iuphar.receptor
JCVI_TIGRFAMS - invalid
SOY_QTL - invalid
DDBJ - invalid
PRINTS - nonstandard > Switch to standard prefix: prints
PO_REF - invalid
IMG - invalid
CL - nonstandard > Switch to standard prefix: cl
UniProtKB-SubCell - nonstandard > Switch to standard prefix: uniprot.location
NIF_Subcellular - nonstandard > Switch to standard prefix: nlx.sub
GeneDB - nonstandard > Switch to standard prefix: genedb
ApiDB_PlasmoDB - nonstandard > Switch to standard prefix: plasmodb
RNAcentral - nonstandard > Switch to standard prefix: rnacentral
CGD_LOCUS - invalid
Rfam - nonstandard > Switch to standard prefix: rfam
Broad_NEUROSPORA - invalid
AGI_LocusCode - invalid
OBO_SF2_PO - invalid
FMA - nonstandard > Switch to standard prefix: fma
CDD - nonstandard > Switch to standard prefix: cdd
PubChem_Compound - nonstandard > Switch to standard prefix: pubchem.compound
HGNC_gene - invalid
PharmGKB - invalid
VMD - invalid
UniParc - nonstandard > Switch to standard prefix: uniparc
MEROPS - invalid
GDB - invalid
SEED - nonstandard > Switch to standard prefix: seed
SO - nonstandard > Switch to standard prefix: so
Soy_gene - invalid
CORUM - nonstandard > Switch to standard prefix: corum
RHEA - nonstandard > Switch to standard prefix: rhea
dbSNP - nonstandard > Switch to standard prefix: dbsnp
MaizeGDB_Locus - nonstandard > Switch to standard prefix: maizegdb.locus
MO - nonstandard > Switch to standard prefix: mo
PLANA_REF - invalid
ISBN - nonstandard > Switch to standard prefix: isbn
BRENDA - nonstandard > Switch to standard prefix: brenda
ASAP - nonstandard > Switch to standard prefix: asap
CAS - nonstandard > Switch to standard prefix: cas
H-invDB_locus - invalid
UM-BBD_ruleID - nonstandard > Switch to standard prefix: umbbd.rule
NCBITaxon - nonstandard > Switch to standard prefix: ncbitaxon
ComplexPortal - nonstandard > Switch to standard prefix: complexportal
JSTOR - nonstandard > Switch to standard prefix: jstor
GRIMS - invalid
PATO - nonstandard > Switch to standard prefix: pato
GR_QTL - nonstandard > Switch to standard prefix: gramene.qtl
ECOGENE - nonstandard > Switch to standard prefix: ecogene
HPA_antibody - invalid
VBRC - nonstandard > Switch to standard prefix: vbrc
EO_GIT - invalid
EchoBASE - nonstandard > Switch to standard prefix: echobase
CASGEN - invalid
IUPHAR_RECEPTOR - nonstandard > Switch to standard prefix: iuphar.receptor
IRIC - invalid
GenBank - nonstandard > Switch to standard prefix: genbank
TGD - nonstandard > Switch to standard prefix: tgd
JCVI_EGAD - invalid
PubChem_BioAssay - nonstandard > Switch to standard prefix: pubchem.bioassay
TC - nonstandard > Switch to standard prefix: tcdb
SABIO-RK - nonstandard > Switch to standard prefix: sabiork.reaction
OBO_SF2_PECO - invalid
MetaCyc - nonstandard > Switch to standard prefix: metacyc.compound
PAMGO_GAT - invalid
ModBase - invalid
OMIM - nonstandard > Switch to standard prefix: omim
GR_MUT - invalid
HPA - nonstandard > Switch to standard prefix: hpa
IntAct - nonstandard > Switch to standard prefix: intact
ProDom - nonstandard > Switch to standard prefix: prodom
GRIN - invalid
WBPhenotype - nonstandard > Switch to standard prefix: wbphenotype
BioCyc - nonstandard > Switch to standard prefix: biocyc
ENSEMBL_GeneID - invalid
PIR - invalid
UniProtKB-KW - nonstandard > Switch to standard prefix: uniprot.keyword
Planteome_gene - invalid
AspGD - invalid
JCVI_Medtr - invalid
EuPathDB - invalid
PMCID - nonstandard > Switch to standard prefix: pmc
matentzn commented 2 years ago

The output is amazing :)

Well you are testing against the the wrong context. Chris will insist that the biolink context will become the source of truth for all Monarch related projects, GO has its own history. So the validation will have to be done against the “preferred context” of that group, which, of course, is registered at bioregistry like the obo context..

On Tue, 2 Aug 2022 at 20:10, Charles Tapley Hoyt @.***> wrote:

so like who are your target people in mind and how will they feel about most of their stuff being wrong? E.g., I ran bioregistry validate jsonld " https://raw.githubusercontent.com/prefixcommons/prefixcommons-py/master/prefixcommons/registry/go_context.jsonld" --relax and got this output:

BIOMD - nonstandard > Switch to standard prefix: biomodels.db COG_Function - invalid WB - nonstandard > Switch to standard prefix: wormbase FBbt - nonstandard > Switch to standard prefix: fbbt KEGG_LIGAND - nonstandard > Switch to standard prefix: kegg.ligand PSO_GIT - invalid MaizeGDB_stock - invalid EMAPA - nonstandard > Switch to standard prefix: emapa GO - nonstandard > Switch to standard prefix: go NCBI_GP - invalid NMPDR - invalid CASSPC - nonstandard > Switch to standard prefix: casspc TGD_REF - invalid NCBIGene - nonstandard > Switch to standard prefix: ncbigene KEGG_REACTION - nonstandard > Switch to standard prefix: kegg.reaction PseudoCAP - invalid UniPathway - nonstandard > Switch to standard prefix: upa MEROPS_fam - invalid GO_REF - nonstandard > Switch to standard prefix: go.ref VEGA - nonstandard > Switch to standard prefix: vega ZFIN - nonstandard > Switch to standard prefix: zfin AspGD_REF - invalid RO - nonstandard > Switch to standard prefix: ro Pfam - nonstandard > Switch to standard prefix: pfam UBERON - nonstandard > Switch to standard prefix: uberon GR - invalid PDB - nonstandard > Switch to standard prefix: pdb CORIELL - nonstandard > Switch to standard prefix: coriell JCVI_GenProp - invalid SGN - nonstandard > Switch to standard prefix: sgn BFO - nonstandard > Switch to standard prefix: bfo Genesys-pgr - invalid UniMod - nonstandard > Switch to standard prefix: unimod UM-BBD_reactionID - nonstandard > Switch to standard prefix: umbbd.reaction PubChem_Substance - nonstandard > Switch to standard prefix: pubchem.substance EcoCyc - nonstandard > Switch to standard prefix: ecocyc Reactome - nonstandard > Switch to standard prefix: reactome InterPro - nonstandard > Switch to standard prefix: interpro UniRule - nonstandard > Switch to standard prefix: unirule MGCSC_GENETIC_STOCKS - invalid dictyBase - nonstandard > Switch to standard prefix: dictybase PO_GIT - invalid AspGD_LOCUS - nonstandard > Switch to standard prefix: aspgd.locus SGD - nonstandard > Switch to standard prefix: sgd COG_Pathway - nonstandard > Switch to standard prefix: cog.pathway ENZYME - invalid PAMGO_MGG - invalid AgBase - invalid AraCyc - invalid EcoCyc_REF - invalid CHEBI - nonstandard > Switch to standard prefix: chebi HGNC - nonstandard > Switch to standard prefix: hgnc dictyBase_gene_name - invalid TAIR - invalid EnsemblFungi - nonstandard > Switch to standard prefix: ensembl.fungi Wikipedia - nonstandard > Switch to standard prefix: wikipedia.en SUPERFAMILY - invalid SWALL - invalid PSI-MOD - nonstandard > Switch to standard prefix: mod FYPO - nonstandard > Switch to standard prefix: fypo RGD - nonstandard > Switch to standard prefix: rgd UM-BBD_enzymeID - nonstandard > Switch to standard prefix: umbbd.enzyme Broad_MGG - invalid Swiss-Prot - nonstandard > Switch to standard prefix: uniprot PMID - nonstandard > Switch to standard prefix: pubmed Xenbase - nonstandard > Switch to standard prefix: xenbase PR - nonstandard > Switch to standard prefix: pr MIPS_funcat - invalid GR_REF - invalid MaizeGDB - nonstandard > Switch to standard prefix: maizegdb.locus HAMAP - nonstandard > Switch to standard prefix: hamap SGN_ref - invalid TO_GIT - invalid MeSH - nonstandard > Switch to standard prefix: mesh GR_PROTEIN - nonstandard > Switch to standard prefix: gramene.protein MaizeGDB_REF - invalid GEO - nonstandard > Switch to standard prefix: geo PO - nonstandard > Switch to standard prefix: po PomBase - nonstandard > Switch to standard prefix: pombase ENA - nonstandard > Switch to standard prefix: ena.embl PIRSF - nonstandard > Switch to standard prefix: pirsf EMBL - invalid Prosite - nonstandard > Switch to standard prefix: prosite H-invDB_cDNA - invalid EC - nonstandard > Switch to standard prefix: eccode MACSC_REF - invalid PAMGO_VMD - invalid IRGC - invalid NASC_code - invalid COG_Cluster - nonstandard > Switch to standard prefix: cog TreeGenes - invalid WB_REF - nonstandard > Switch to standard prefix: wormbase TGD_LOCUS - invalid MA - nonstandard > Switch to standard prefix: ma UniProtKB - nonstandard > Switch to standard prefix: uniprot MGI - nonstandard > Switch to standard prefix: mgi GRINDesc - invalid DDANAT - nonstandard > Switch to standard prefix: ddanat RAP-DB - invalid gomodel - nonstandard > Switch to standard prefix: go.model KEGG_PATHWAY - nonstandard > Switch to standard prefix: kegg.pathway BTO - nonstandard > Switch to standard prefix: bto JCVI_CMR - invalid dictyBase_REF - invalid DOI - nonstandard > Switch to standard prefix: doi LIFEdb - invalid PANTHER - invalid Gene3D - invalid PATRIC - invalid FB - nonstandard > Switch to standard prefix: flybase PAINT_REF - invalid CASREF - invalid ENSEMBL - nonstandard > Switch to standard prefix: ensembl SMART - nonstandard > Switch to standard prefix: smart RefSeq - nonstandard > Switch to standard prefix: refseq WBls - nonstandard > Switch to standard prefix: wbls MaizeGDB_QTL - invalid SOY_ref - invalid ECO - nonstandard > Switch to standard prefix: eco CGD_REF - invalid ECK - invalid CGD - nonstandard > Switch to standard prefix: cgd GR_GENE - nonstandard > Switch to standard prefix: gramene.gene RNAmods - nonstandard > Switch to standard prefix: rnamods KEGG_ENZYME - nonstandard > Switch to standard prefix: kegg.enzyme CACAO - invalid IUPHAR_GPCR - nonstandard > Switch to standard prefix: iuphar.receptor JCVI_TIGRFAMS - invalid SOY_QTL - invalid DDBJ - invalid PRINTS - nonstandard > Switch to standard prefix: prints PO_REF - invalid IMG - invalid CL - nonstandard > Switch to standard prefix: cl UniProtKB-SubCell - nonstandard > Switch to standard prefix: uniprot.location NIF_Subcellular - nonstandard > Switch to standard prefix: nlx.sub GeneDB - nonstandard > Switch to standard prefix: genedb ApiDB_PlasmoDB - nonstandard > Switch to standard prefix: plasmodb RNAcentral - nonstandard > Switch to standard prefix: rnacentral CGD_LOCUS - invalid Rfam - nonstandard > Switch to standard prefix: rfam Broad_NEUROSPORA - invalid AGI_LocusCode - invalid OBO_SF2_PO - invalid FMA - nonstandard > Switch to standard prefix: fma CDD - nonstandard > Switch to standard prefix: cdd PubChem_Compound - nonstandard > Switch to standard prefix: pubchem.compound HGNC_gene - invalid PharmGKB - invalid VMD - invalid UniParc - nonstandard > Switch to standard prefix: uniparc MEROPS - invalid GDB - invalid SEED - nonstandard > Switch to standard prefix: seed SO - nonstandard > Switch to standard prefix: so Soy_gene - invalid CORUM - nonstandard > Switch to standard prefix: corum RHEA - nonstandard > Switch to standard prefix: rhea dbSNP - nonstandard > Switch to standard prefix: dbsnp MaizeGDB_Locus - nonstandard > Switch to standard prefix: maizegdb.locus MO - nonstandard > Switch to standard prefix: mo PLANA_REF - invalid ISBN - nonstandard > Switch to standard prefix: isbn BRENDA - nonstandard > Switch to standard prefix: brenda ASAP - nonstandard > Switch to standard prefix: asap CAS - nonstandard > Switch to standard prefix: cas H-invDB_locus - invalid UM-BBD_ruleID - nonstandard > Switch to standard prefix: umbbd.rule NCBITaxon - nonstandard > Switch to standard prefix: ncbitaxon ComplexPortal - nonstandard > Switch to standard prefix: complexportal JSTOR - nonstandard > Switch to standard prefix: jstor GRIMS - invalid PATO - nonstandard > Switch to standard prefix: pato GR_QTL - nonstandard > Switch to standard prefix: gramene.qtl ECOGENE - nonstandard > Switch to standard prefix: ecogene HPA_antibody - invalid VBRC - nonstandard > Switch to standard prefix: vbrc EO_GIT - invalid EchoBASE - nonstandard > Switch to standard prefix: echobase CASGEN - invalid IUPHAR_RECEPTOR - nonstandard > Switch to standard prefix: iuphar.receptor IRIC - invalid GenBank - nonstandard > Switch to standard prefix: genbank TGD - nonstandard > Switch to standard prefix: tgd JCVI_EGAD - invalid PubChem_BioAssay - nonstandard > Switch to standard prefix: pubchem.bioassay TC - nonstandard > Switch to standard prefix: tcdb SABIO-RK - nonstandard > Switch to standard prefix: sabiork.reaction OBO_SF2_PECO - invalid MetaCyc - nonstandard > Switch to standard prefix: metacyc.compound PAMGO_GAT - invalid ModBase - invalid OMIM - nonstandard > Switch to standard prefix: omim GR_MUT - invalid HPA - nonstandard > Switch to standard prefix: hpa IntAct - nonstandard > Switch to standard prefix: intact ProDom - nonstandard > Switch to standard prefix: prodom GRIN - invalid WBPhenotype - nonstandard > Switch to standard prefix: wbphenotype BioCyc - nonstandard > Switch to standard prefix: biocyc ENSEMBL_GeneID - invalid PIR - invalid UniProtKB-KW - nonstandard > Switch to standard prefix: uniprot.keyword Planteome_gene - invalid AspGD - invalid JCVI_Medtr - invalid EuPathDB - invalid PMCID - nonstandard > Switch to standard prefix: pmc

— Reply to this email directly, view it on GitHub https://github.com/biopragmatics/bioregistry/issues/490#issuecomment-1203000533, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABV6HJ3WA57OKCIN675JX53VXFJAPANCNFSM55LO2ZAA . You are receiving this because you authored the thread.Message ID: @.***>

cthoyt commented 2 years ago

okay I will think about how this might work, since it would still be nice to make suggestions (but like you said, this should only support contexts registered in Bioregistry that are loved and cared for)