INCATools / ontology-development-kit

Bootstrap an OBO Library ontology
http://incatools.github.io/ontology-development-kit/
BSD 3-Clause "New" or "Revised" License
228 stars 54 forks source link

Dependencies between imports cause incomplete seeding when using base releases #174

Closed matentzn closed 2 years ago

matentzn commented 5 years ago

When using base releases, seeding is currently incomplete. Example O imports GO and PR GO:1 belongs to GO PR uses GO:1

Starting point: O is empty GO is extracted first O does not contain GO:1 PR is extracted: O contains GO:1, but since GO is already extracted, axiom dependencies are missing.

I believe the problem can only be remedied by running the imports pipeline twice; or does anyone have a better idea? @cmungall @balhoff

balhoff commented 5 years ago

This situation is why I advocate merging all imports before doing the extraction in here: https://github.com/balhoff/ultimate-ontology-makefile

But then we don't have our separate "go_import", "pr_import", etc. Which I am fine with, but it's a bigger change to old workflows. This kind of goes along with publishing two versions: base file, and fully merged file.

matentzn commented 5 years ago

Yeah I moved away from that solution because of the memory consumption. Once the set of all imports gets too large, travis jobs start failing, and even desktop computers run out of memory..

matentzn commented 5 years ago

So, Jim is proposing a web service based solution for this problem (this would, at least, take care of memory and storage limitations of travis). This leads me to one question I have been meaning to ask for a while; @cmungall do we have resources to deploy a variety of reasoning services for OBO ontologies somewhere in your infrastructure? Something like web services for owlery deployments of main ontologies (this is more than OLS: this is about being able to have DL query endpoints, module extraction services etc).

Jims idea to this here issue is to deploy a service that allows you to extract a module from an arbitrary union of ontologies. Obviously we would need the usual; fall backs, load balancing etc. But I think its worth a thought!

cmungall commented 5 years ago

How will this work with the need to pin releases to versionIRIs? We'd need a triplestore with all versions loaded. If we just want the most recent one then may as well use ontobee (and in fact why not just use ontofox)

matentzn commented 5 years ago

Yeah, you could not pin a release to a version, thats true. I mean, you could have a config file that took care of that of course, for the web service i mean..

In general, can you see any other way to solve this problem? Some kind of smart ordering of imports goals such that the dependants are extracted right before the dependends, and the seed.txt regenerated each time? Or any other idea? Or simply run twice :/

balhoff commented 5 years ago

Working off of a triplestore will require implementing the SLME algorithm over SPARQL. I think that would be cool, but quicker would be to load up ontologies into OWL API with lots of memory.

On the other hand I regularly consider managing a triplestore containing all historical versions of OBO ontologies.

Am I right that Ontofox does MIREOT but not SLME? Is MIREOT sufficient?

cmungall commented 5 years ago

Is twice guaranteed enough? We have some reciprocal dependencies so to guarantee would you not need to iterate until saturation?

Re: travis. Are you saying travis can't handle building the imports, or that a merged imports causes issues even when reasoning? We could simply not have travis make imports and have this be the job of an ontology release manager.

matentzn commented 5 years ago

Will think about all of the above tomorrow. Do you think people would be cool with a one module solution? All imports one module?

cmungall commented 5 years ago

I wonder if memory issues are caused by annotations and axiom annotations. What about:

1 merge all base logical axioms in ontologies of interest 2 SLME to get all classes and logical axioms 3 Re-SLME on each mirrored ontology using 2 as seed

Maybe this is what you meant by do it twice?

Also, which ontologies are the memory hogs? I'm guessing:

We already release a taxslim which is good for many purposes. I've asked pr before for a species-neutral level subset, I don't think any ontology ever needs the part that duplicates uniprot or below

matentzn commented 5 years ago

I am just worried that step 1 and 2 will take too long and cause memory exception if huge non-base ontologies are in the mix. I will try it a bit next week.

matentzn commented 5 years ago

@balhoff What do you think of chris's suggestion?

  1. First extract logical axioms
  2. merge them together imp-log-merged.owl
  3. extract SLME with Sig(edit-file without imports)
  4. replace the seed used for module extraction by Sig(imp-log-merged.owl).
  5. Leave everything else as is.

Or do you still favour the merge all, extract one approach and just assume the client has 8GB main memory to do this?

matentzn commented 4 years ago

I am working currently on this revised system for dynamic imports. It is complex, but its a big thorn in my eyes, as it has been broken for quite a while now (the last ODK thorn, the rest is mostly cosmetics).

image

The idea is this:

  1. We make the DOSDP patterns pipeline truly independent of the whole thing, no more cross-dependency weirdness like extracting the seed from the TSV files etc. You should be able to run the pattern pipeline without any regards for anything else.
    • mirror/go_label.owl is the set of all label triples in GO {GO:001 rdfs:label "X process"; ... }
    • dosdp-dict.owl is the merge of all mirror/x_label.owl. This file is used to generate DOSDP labels when using dosdp-generate. Note that in order to use this for matching as well, we need to extend it to subclass of + labels, which is considerably more expensive to do as it requires reasoning.
  2. The import pipeline (blue blob) relies on the @cmungall design suggestion of logical subsets of mirrors.
    1. A pre-seed.txt is extracted from definitions.owl, o-edit.owl, and any component files (remember, components are files that belong formally to the edit file but are managed in separate artefacts, like maxo-obs.owl).
    2. mirror/go_logical.owl is a subset of GO that contains only logical axioms. This ensures that if we do a module extraction, the memory footprint is not too large.
    3. mirror_logical_merged.owl merges all mirror/x_logical.owl into one.
    4. A bottom module mirror_logical_module.owl is extracted from mirror_logical_merged.owl, using the seed pre-seed.txt.
    5. The proper seed.txt is extracted from mirror_logical_module.owl, which now contains absolutely all terms we need to build the modules.
    6. As usual, this seed.txt is used to extract the modules, like imports/go_import.owl.

This looks complex, but I think its sound. Please review this @cmungall @dosumis @balhoff as I want to implement this asap.

balhoff commented 4 years ago

@matentzn can we discuss this on an ODK call?

matentzn commented 4 years ago

I tried it a few times.. we can try to discuss it again :P it really needs to be solved - without external dependencies I think :(

matentzn commented 3 years ago

We have been playing with the idea of using a merged ontology to extract modules from, but there are worries about losing fine-grained control, like avoiding pulling from a "current" version that broke something in my own ontology. We decided now that after all, we should keep a "ontology by ontology" workflow, and maybe even boost that to something more in the direction of maven.. I will implement the union extract technique after all - using the union. We will solve the CHEBI, PRO, NCBITAXON issue seperately.

matentzn commented 2 years ago

This is now addressed since 1.2.32 with the new BASE pipeline. Yay.