Closed jamesamcl closed 1 year ago
For that 6.5 GB JSON file, json2csv took 3 minutes and generated 765 MB of CSV
This seems a suspiciously large difference, but I tried gzipping them to see how much ACTUAL data there was and not just repetition:
Those numbers are firmly in the same ballpark so I think no data has been lost, so all of obo foundry is actually pretty tiny depending on how you represent it.
I also tried gzipping ALL of the OLS “downloads” folder from noah, so that’s all the OWL files from OBO and OLS’s ontologies, which also includes lots of obsolete stuff I didn’t index above. That compressed to 886 MB. So all of the data in OLS is actually only 886 MB when compressed!
For
doid
cto
cvdo
mfmo
ons
ro
upheno
mamo
vario
can you list the import URLs that are not rdfxml? I may be able to fix these with a bit of a sledge hammer.
ogi https://github.com/OBOFoundry/OBOFoundry.github.io/issues/1942 ero (inactive on obo, URL of ontology redirects to website) https://github.com/OBOFoundry/OBOFoundry.github.io/issues/1942 rnao Resolves: http://purl.obolibrary.org/obo/rnao.owl
Hi @matentzn
I checked the latest indexer run and these seem to be the OBO ontologies we still have a problem with:
I manually checked most of these. I personally would suggest to restrict OLS to only active ontologies in OBO:
All of these ones you listed here (or most of them, didn't check all) are obsolete, or inactive. OBO Foundry does not recommend the use of non-active ontologies (i.e. they are hidden on https://obofoundry.org/)
@matentzn they were manually checked by me too to make the table. I didn't realise they were obsolete/inactive. However they will be completely absent (= 404) from OLS when we ship OLS4 if we do not load them. Will this an issue?
In general I am happy with not loading inactive ontologies. However, even if an ontology is inactive, it can still be used and we can not drop its availability - particularly when there seems to be no other alternative. I think MAMO is a good example of this and is used in EBI by the BioModels team.
A way around this is to not load inactive OBO ontologies. In a case like MAMO we can add it to the EBI OLS config with the URL pointing to the file system.
@henrietteharmse I think your suggestion is the way to go.
Maybe be a bit more conservative for now and only exclude obsolete ontologies from OBO to start with. If you supply me with a list @udp with the remaining (non obsolete, breaking ones), I can maybe reach out to the groups and use OLS inclusions to up their game a bit and fix their ontology.
@matentzn We currently have an issue with RO. Though the core file is RDF/XML:
https://raw.githubusercontent.com/oborel/obo-relations/master/ro.owl
it imports this file: https://raw.githubusercontent.com/oborel/obo-relations/master/chemical.owl which is in functional syntax.
Issue opened here: https://github.com/oborel/obo-relations/issues/673
This will be solved soon by @anitacaron, the solution is already there, we just need time to review and implement it.
For the Human Disease Ontology (doid
), would loading the doid-merged.owl file (http://purl.obolibrary.org/obo/doid/doid-merged.owl), which has all imports loaded in, fix this issue?
@lschriml, fyi.
The doid file that isn't RDF/XML was our ext.owl file (in OFN). We recently switched it to RDF/XML because other people were experiencing parsing issues (https://github.com/DiseaseOntology/HumanDiseaseOntology/issues/1112).
@udp, can you confirm that RO is not having issues anymore, please?
With owl2json, out of all of the ontologies in the OBO foundry:
186 loaded successfully
9 weren’t RDF or had imports that weren’t:
3 were invalid RDF:
3 had protocol errors (e.g. 404)
52 were marked as obsolete and had no top level ontology_purl. Some of them still had an ontology_purl under the “products” field but the current code doesn’t look at that.
It took 15 minutes on codon to load the above in series, no parallelisation at all. It generated a 6.5 GB json file for everything combined.