EBISPOT / ols4

Version 4 of the EMBL-EBI Ontology Lookup Service (OLS)
http://www.ebi.ac.uk/ols4/
Apache License 2.0
47 stars 21 forks source link

obo stats #1

Closed jamesamcl closed 1 year ago

jamesamcl commented 2 years ago

With owl2json, out of all of the ontologies in the OBO foundry:

186 loaded successfully

9 weren’t RDF or had imports that weren’t:

doid
cto
cvdo
mfmo
ons
ro
upheno
mamo
vario

3 were invalid RDF:

    uo, issue is open on their tracker since yesterday
    ms, imports uo
    genepio, invalid IRIs in the file contain unescaped spaces

3 had protocol errors (e.g. 404)

ogi
ero
rnao

52 were marked as obsolete and had no top level ontology_purl. Some of them still had an ontology_purl under the “products” field but the current code doesn’t look at that.

It took 15 minutes on codon to load the above in series, no parallelisation at all. It generated a 6.5 GB json file for everything combined.

jamesamcl commented 2 years ago

For that 6.5 GB JSON file, json2csv took 3 minutes and generated 765 MB of CSV

This seems a suspiciously large difference, but I tried gzipping them to see how much ACTUAL data there was and not just repetition:

Those numbers are firmly in the same ballpark so I think no data has been lost, so all of obo foundry is actually pretty tiny depending on how you represent it.

I also tried gzipping ALL of the OLS “downloads” folder from noah, so that’s all the OWL files from OBO and OLS’s ontologies, which also includes lots of obsolete stuff I didn’t index above. That compressed to 886 MB. So all of the data in OLS is actually only 886 MB when compressed!

matentzn commented 2 years ago

For

doid
cto
cvdo
mfmo
ons
ro
upheno
mamo
vario

can you list the import URLs that are not rdfxml? I may be able to fix these with a bit of a sledge hammer.

ogi https://github.com/OBOFoundry/OBOFoundry.github.io/issues/1942 ero (inactive on obo, URL of ontology redirects to website) https://github.com/OBOFoundry/OBOFoundry.github.io/issues/1942 rnao Resolves: http://purl.obolibrary.org/obo/rnao.owl

jamesamcl commented 1 year ago

Hi @matentzn

I checked the latest indexer run and these seem to be the OBO ontologies we still have a problem with:

ontology id purl problem
mamo http://purl.obolibrary.org/obo/mamo.owl OWL XML
vario http://purl.obolibrary.org/obo/vario.owl OWL ??
gaz http://purl.obolibrary.org/obo/gaz.obo OBO
dinto http://purl.obolibrary.org/obo/dinto.owl Redirects to the github repo
eo http://purl.obolibrary.org/obo/eo.owl Redirects to https://raw.githubusercontent.com/Planteome/plant-environment-ontology/master/plant-environment-ontology.obo.owl which is 404
epo http://purl.obolibrary.org/obo/epo.owl Redirects to https://epidemiology-ontology.googlecode.com/files/epidemiology_ontology.owl which is 404
ero http://purl.obolibrary.org/obo/ero.owl Redirects to https://open.catalyst.harvard.edu/products/eagle-i/ which is a HTML page not an ontology
flu http://purl.obolibrary.org/obo/flu.owl Imports http://purl.obolibrary.org/obo/ido/2010-12-02/ido-main-workaround.owl which is 404
mfo http://purl.obolibrary.org/obo/mfo.owl Redirects to https://obofoundry.org/ not an ontology
mirnao http://purl.obolibrary.org/obo/mirnao.owl Redirects to http://mirna-ontology.googlecode.com/svn/trunk/src/ontology/mirnao.owl which is 404
mo http://purl.obolibrary.org/obo/mo.owl Redirects to http://ontologies.berkeleybop.org/ which is not an ontology
nmr http://purl.obolibrary.org/obo/nmr.owl Redirects to http://ontologies.berkeleybop.org/ which is not an ontology
ogi http://purl.obolibrary.org/obo/ogi.owl Redirects to https://ontology-for-genetic-interval.googlecode.com/svn/trunk/src/OGI.owl which is 404
sep http://purl.obolibrary.org/obo/sep.owl Redirects to http://ontologies.berkeleybop.org/sep.owl NoSuchKey
vhog http://purl.obolibrary.org/obo/vhog.owl Redirects to http://ontologies.berkeleybop.org/vhog.owl NoSuchKey
matentzn commented 1 year ago

I manually checked most of these. I personally would suggest to restrict OLS to only active ontologies in OBO:

https://obofoundry.org/

All of these ones you listed here (or most of them, didn't check all) are obsolete, or inactive. OBO Foundry does not recommend the use of non-active ontologies (i.e. they are hidden on https://obofoundry.org/)

jamesamcl commented 1 year ago

@matentzn they were manually checked by me too to make the table. I didn't realise they were obsolete/inactive. However they will be completely absent (= 404) from OLS when we ship OLS4 if we do not load them. Will this an issue?

henrietteharmse commented 1 year ago

In general I am happy with not loading inactive ontologies. However, even if an ontology is inactive, it can still be used and we can not drop its availability - particularly when there seems to be no other alternative. I think MAMO is a good example of this and is used in EBI by the BioModels team.

A way around this is to not load inactive OBO ontologies. In a case like MAMO we can add it to the EBI OLS config with the URL pointing to the file system.

matentzn commented 1 year ago

@henrietteharmse I think your suggestion is the way to go.

Maybe be a bit more conservative for now and only exclude obsolete ontologies from OBO to start with. If you supply me with a list @udp with the remaining (non obsolete, breaking ones), I can maybe reach out to the groups and use OLS inclusions to up their game a bit and fix their ontology.

jamesamcl commented 1 year ago

@matentzn We currently have an issue with RO. Though the core file is RDF/XML:

https://raw.githubusercontent.com/oborel/obo-relations/master/ro.owl

it imports this file: https://raw.githubusercontent.com/oborel/obo-relations/master/chemical.owl which is in functional syntax.

Issue opened here: https://github.com/oborel/obo-relations/issues/673

matentzn commented 1 year ago

This will be solved soon by @anitacaron, the solution is already there, we just need time to review and implement it.

allenbaron commented 1 year ago

For the Human Disease Ontology (doid), would loading the doid-merged.owl file (http://purl.obolibrary.org/obo/doid/doid-merged.owl), which has all imports loaded in, fix this issue?

@lschriml, fyi.

allenbaron commented 1 year ago

The doid file that isn't RDF/XML was our ext.owl file (in OFN). We recently switched it to RDF/XML because other people were experiencing parsing issues (https://github.com/DiseaseOntology/HumanDiseaseOntology/issues/1112).

anitacaron commented 1 year ago

@udp, can you confirm that RO is not having issues anymore, please?