INCATools / ontology-access-kit

Ontology Access Kit: A python library and command line application for working with ontologies
https://incatools.github.io/ontology-access-kit/
Apache License 2.0
118 stars 28 forks source link

Document if and when to write plugins, and what the alternatives are #171

Open cmungall opened 2 years ago

cmungall commented 2 years ago

Currently we have docs on how to write plugins:

https://incatools.github.io/ontology-access-kit/howtos/write-a-plugin.html

From the docs:


A plugin is a software component that adds a specific feature to an existing computer program. When a program supports plug-ins, it enables customization (Wikipedia).

In the context of OAK, plugins allow you to add optional functionality to OAK.

An example of a plugin would be a code that uses a machine learning library such as Spacy or NLTK to provide an implementation of a text_annotator. We wouldn’t want to include this in the main library due to the additional dependencies required. But we might want to allow people to drop in this functionality, e.g.

runoak -i my_awesome_plugin: annotate "my text here"

We have plans to do exactly that, to use scispacy for annotation via OntoRunner. I think introducing a dependency on scispacy would be a bad idea - it is a heavyweight dependency, it could reduce portability. And just in general, the more dependencies we have the larger the surface area for things going wrong.

We also have a plugin for robot: https://github.com/INCATools/oakx-robot

Perhaps this one could be brought into the core - it doesn't actually bring robot in as a dependency per se, only py4j.

An alternative to plugins is a dynamic import statement, as in this PR:

https://github.com/INCATools/ontology-access-kit/pull/159

I am not a big fan of this:

In that PR @cthoyt states:

I really want to stress that plug-ins should only be used by external people who want to extend the package but not in an open source way - eg the million different repo approach isn’t sustainable (I learned the hard way in bio2bel)

I am not sure I follow this. I would certainly prefer that people contribute back open code, but everyone is free to either extend OAK, to adapt it, or make plugins in whichever way they please, open or not. And I think we can actively work to encourage the right people for key plugins.

My proposed guidelines would be:

  1. if it doesn't introduce too heavyweight dependencies, include it right in this repo, no need for plugins
  2. if it is a library or application that reuses oak, then just reuse it in the standard way
  3. if it introduces complicated dependencies, and the functionality is optional, or provides some kind of slight gain (e.g. improved NER/CR via ML, enhanced semantic similarity via an R bridge) then that is a good candidate for a plugin
  4. avoid dynamic imports, except as part of the plugin system itself

This still leaves things open. Would pandas be considered heavyweight? Probably not. What about scikit-learn? On the fence.

deepakunni3 commented 2 years ago

@cmungall I like your proposed guidelines above.

Couple of thoughts about plugins:

I agree that any external heavy weight dependency that is not critical for oak's functionality should be optional. And plugins can be a way to provide that optionality.

Plugins can be implemented in a couple of different ways and the approach we take will matter in the long run.

If we want to have a community that will create, maintain, and share plugins then we need a plugin system which one can use to define a plugin and discover other plugins (à la plugin registry).

While prefixing a repository name as oak signals that it is an oak plugin, it might be an imprecise way to define plugins. Searching on GitHub gives thousands of repo with oak in their name. That means discovery of plugins is a challenge. It puts restrictions on how a repo is named and nothing else.

I think it would be much more helpful if we provided guidelines on the repo structure (code, tests, examples), and each repo has a plugin config (used to define the plugin state). This can also be packaged as a repo template for anyone wanting to make a plugin. This structure also helps us ensure plugin discovery.

I looked into a couple of ways how to do plugins for linkml-validator and I settled on:

While this approach I took doesn't allow for discovery, it does allow for hooking custom validation scenarios as plugins at runtime.

I agree with @cthoyt concern that plugins can be difficult but that is only if we require them to be in the same repo as oak. If it is outside of oak, then we should provide a lightweight framework for creating and discovering plugins.

oak is a library that is trying to solve many problems by abstracting away many details and providing a common interface. That means it is extensible by design and multiple implementations of such common interface is highly likely and encouraged.

This is definitely something we should brainstorm and explore more :)

On a separate note: I think https://github.com/INCATools/oakx-robot should be subsumed into oak.