NeurodataWithoutBorders / nwb-schema

Data format specification schema for the NWB neurophysiology data format
http://nwb-schema.readthedocs.io
Other
52 stars 16 forks source link

Add ontology support #1

Open ajtritt opened 7 years ago

ajtritt commented 7 years ago

Originally reported by: Andrew Tritt (Bitbucket: ajtritt, GitHub: ajtritt)



neuromusic commented 6 years ago

FYI the link above to the bitbucket issue is offline

bendichter commented 6 years ago

@neuromusic @ajtritt @oruebel

I would like to reopen this discussion, because I think supporting ontologies is going to be critical for the scalability of NWB. I understand that raw strings were used as placeholders, but if we don't change that before the 2.0 release we could end up with a big mess.

First things first: Should we enforce specific ontologies? I think we should, but I can also see the trade-offs. Maybe someone comes along and says that e.g. the Allan Atlas is insufficient to describe their anatomical labeling, and they require some other ontology. Should we give people the ability to use their own ontologies, like the current framework for extensions? On the other hand, then we open the floodgates for people to use any old garbage. I'm on the side of enforcing ontologies and providing text fields in case our choice is insufficient. I think that's an important design decision though that ought to be discussed and I'd be interested in your thoughts.

I would like to establish an ontology for:

my criteria for ontologies would be:

For mouse brain region I think the Allen Mouse Atlas makes the most sense. Are there any other candidates? Human brain labeling is an enormous can of worms. Allen's human atlas seems good to me but there are other contenders-- lets worry about that later.

For species ontology I think the scientific community have pretty much reached consensus but maybe there are ontological debates I don't know about.

For cell type I don't really know about the available ontologies. I hear Allen is working on this too?

bendichter commented 6 years ago

@neuromusic my web scraping skills aren't what they used to be. Would it be easy to get us the atlas label tree from the mouse brain atlas link above as a json?

neuromusic commented 6 years ago

best approach here IMO would be to defer to the experts at NIF like @tgbugs https://bioportal.bioontology.org/ontologies/NIFSTD

tgbugs commented 6 years ago

Short version.

Let me know how I can help. Ontology integration is not entirely straight forward, so more than happy to help get the relevant parties pointed in the right direction. You've caught me right as I'm starting to write up what we've done in a higher level way, so all I have at the moment is the flood of information that follows.

The import closure of NIFSTD is quite large, so we tend to provide the whole ontology via SciGraph webservices. I have a python client for accessing them. If it would be helpful I can create a mini file that can be used to import the subset of the ontology that is relevant for NWB.

Our main repo is SciCrunch/NIF-Ontology, with supporting utilities at tgbugs/pyontutils.

Thoughts on enforcing ontologies

I suggest that a reasonable approach would be to include a set of good defaults and also allow organizations or individuals to provide and/or enforce if it fits their use case. If you provide defaults very few folks are likely to go out of their way to find a bad ontology. The format probably should not enforce (in the strict sense) since the number of use cases is quite large and terminology changes over time in ways that the format should not be responsible for maintaining. That said, someone validating NWB files should be able to specify which set of ontology identifiers they want to be used for tagging certain things so that checking can be automated for the user so they don't have to wait to submit a file for feedback but can get it immediately via whatever interface they are interacting with when they create the file.

As an additional note the number of terms that could potentially need to be used in an NWB file is quite a bit larger than is practical to embed in an application. The most frequently used could be, but the tail is quite long, to the point where I think it is not unreasonable to imagine the format hitting web services for terms or having a process by which the terms can be imported and updated along side an installation.

Specific domains.

  1. Species. Yes, use NCBITaxon[0]. It is a huge ontology which you don't want to have to carry around all the time. We use the taxonomy subset from obo which is still large relative NWB needs. I am in the process of reconciling NCBITaxon with our old species ontology which had more neuroscience specific nomenclature as well as specific strains. Here are some relevant github issues. https://github.com/SciCrunch/NIF-Ontology/issues/70
    https://github.com/SciCrunch/NIF-Ontology/issues/132

  2. Brain regions. For general pan specific brain region concepts use Uberon, or possibly just the nervous system subset (though if you have a user that is doing experiments on interaction with muscles or the gut you might want the whole thing). It is far and away the most complete and pretty much all brain region work that is planned for the future will eventually be mapped to uberon. For species specific and experimentally relevant brain parcellations I suggest that you use the collection that we are compiling (happy to answer any questions here). Allen mouse and human are in, as is paxinos rat and waxholm rat, with paxinos mouse and more on the way. Main github issue https://github.com/SciCrunch/NIF-Ontology/issues/49. The main import for all anatomy related files is https://github.com/SciCrunch/NIF-Ontology/blob/staging/ttl/bridge/anatomy-bridge.ttl

  3. Cell types. As I am sure you are aware there is very little consensus on neuron types, and the state of the ontologies reflects this. I am currently right in the middle of creating and promulgating a python mini language for describing experimentally defined cell types based on their measured phenotypes. All of the terms for the neurons are backed by the NIF ontology. The ontologies that back this scheme are much larger and probably cannot be easily embedded (all of the protein ontology and our chebi subset). At the moment NIFSTD has better coverage of traditional neuron types than most other sources, but I do not recommend using those for tagging data sets since they are much higher level and conflate many different experimentally relevant measurements. There is also the cell ontology, but it does not have good coverage of neuron types.

  4. Strains? Stock numbers/mouse lines? I mention these as things that we have found are relevant and that others have approached us about and that we are in the process of incorporating. See https://github.com/SciCrunch/NIF-Ontology/issues/70#issuecomment-358521887 again.

  5. Methods? Tagging with the techniques, tools, and protocols used to collect data and make later interpretation much easier. I am in the middle of creating these for neuroscience specifically, see this README for an overview of what is out there.

All five of these areas are currently under active development in collaboration with groups at HBP, BBP, and now with Allen as well, so happy to help in any way.

Footnotes

  1. An amusing note is that some of the taxonomists I have met think that NCBITaxon is horrible, but they have completely different use cases than most of the experimental scientific community.
bendichter commented 6 years ago

@tgbugs wow, this is some great info, thanks!

Regarding your thoughts on enforcing ontologies, I think your suggestion to offer good defaults and allow extensions for other ontologies makes a lot of sense.

What do you imagine as the workflow for someone writing an NWB file with ontology integration? Referencing an ontology would presumably require users to enter which ontology they are referencing as well as the exact name or id for the category. Maybe something like:

allan_mouse_ontology = AllanMouseOntology()
uberon = UberonOntology()
...
electrode_region = [allan_mouse_ontology.lookup(abbr='CA1'), uberon.lookup(id=0003881)]

We could create a validation step where the ontology and specifier are checked first against a small local database and then if it's not there, against a remote database via an API. A small set of properties can then be written to the file. Something like id, abbr, and full_name. I think we should also allow the remote API check to be turned off, because I don't want pynwb to require internet access to write files.

What would be the workflow for adding a custom ontology? Maybe:

class NewCellOntology(CellOntology):
    def lookup(name=None, id=None, abbr=None):
        ...
        return CellOntologicalReference(name=name, id=id, abbr=abbr)
    def lookup_remote(name=None, id=None, abbr=None):
        ...
        return CellOntologicalReference(name=name, id=id, abbr=abbr)

I'm hoping we can build this in such a way that users are not bogged down in ontological details, so I'm glad to see there is already an issue looking at incorporating colloquial names.

tgbugs commented 6 years ago

This ended up being quite a bit more elaborate than I expected. Have a read and let me know what you think. As a side note I do not have time at the moment nor in the near future to implement this, but probably will at some point since I have a number of use cases for a system like this as well.

PS I have not run any of the code so there are loads of bugs.

Questions

  1. To what extent do you want/not want to depend on external libraries?
  2. How is electrode_region as defined above formatted during serialization (i.e. is it via a call to __str__ or via some other function)?

Context

I imagine that there are three ways that ontology identifiers could be stored in an NWB file. In my experience the third has a number of advantages.

  1. Store the whole raw URL. Example http://purl.obolibrary.org/obo/UBERON_0000955. This is not fun for users who then have to interact with a bunch of urls.
  2. Store a shortened curied representation of a url in the NWB file and implement the expansion rules in pynwb. Example UBERON:0000955. This is OK but it means that the user cannot customize their prefixes and if they do not have access to pynwb for some reason then they have to guess the expansion rules.
  3. Store a shortened curied representation of a url and the expansion rule in the NWB file. Example UBERON:0000955 'UBERON':'http://purl.obolibrary.org/obo/UBERON_'. This is by far the most portable, extensible, and user friendly version since all the information is contained in the NWB file and NWB. As with the ontologies I imagine that NWB would provide a sane set of defaults that match what is used in the wider community. See our curies file for commonly used community curies (note that only the curies needed for urls used in the NWB file would need to be included, which means the list would be much smaller for most files).

API

With that context and the caveat that I am not familiar with the API conventions for pynwb here are my thoughts.

There are four major parts that need to be provided by the API. I will explicate them in order, following a brief justification for the layout. I will end with some concerns about this approach.

  1. Ontology Services
  2. Ontology Query
  3. Ontology Term
  4. Ontology ID
  5. Ontology Curies (late addition so not discussed here, but can elaborate).

Separation of concerns

I think that terms and query need to be distinct because ontologies can change and it is important to have the exact identifier that the investigator used recorded in the python source and reproducible regardless of any changes to what a query would return. There is also the practical fact that queries sometimes return more than one result. In certain use cases this is not a concern and it is entirely reasonable to write code that always queries to get the latest identifier information to embed in the NWB file. In other cases users need/want control over their identifiers. For these reasons I think that the query api and the term api are both needed, and that the query API should transparently return ontology terms in the format of the term API.

Services API

After having gone through this a few times the right approach seems to be to create ontology services that wrap different backends and provide a normalized query interface. Local ontology services can provide loading. For example the rdflib implementation below allows more advanced users to add ontologies via localonts.add('file:///home/user/myontology.ttl', format='turtle'). It shouldn't be too hard to implement ways to ingest other formats for a default local OntService [0]. This outline implementation is missing handling prefixes via OntCuries, but there is a TODO that mentions how to get started implementing it.

Note that the implementation examples use the local and remote backends that I am most familiar with, but it is possible to use other backends.

class OntService:
    """ Base class for ontology wrappers that define setup, dispatch, query,
        add ontology, and list ontologies methods for a given type of endpoint. """
    def __init__(self):
        self._onts = []
        self.setup()

    def add(self, iri):  # TODO implement with setter/appender?
        self._onts.append(iri)
        raise NotImplementedError()

    @property
    def onts(self):
        yield from self._onts

    def setup(self):
        raise NotImplementedError()

    def dispatch(self, prefix=None, category=None):  # return True if the filters pass
        raise NotImplementedError()

    def query(self, *args, **kwargs):  # needs to conform to the OntQuery __call__ signature
        raise NotImplementedError()
        pass

class SciGraphRemote(OntService):  # incomplete and not configureable yet
    def add(self, iri):  # TODO implement with setter/appender?
        raise TypeError('Cannot add ontology to Remote')

    def setup(self):
        self.sgv = scigraph_client.Vocabulary()
        self.sgg = scigraph_client.Graph()
        self.sgc = scigraph_client.Cyper()
        self.curies = sgc.getCuries()  # TODO can be used to provide curies...
        self.categories = sgv.getCategories()
        self._onts = self.sgg.getEdges(relationType='owl:Ontology')  # TODO incomplete and not sure if this works...

    def dispatch(self, prefix=None, category=None):  # return True if the filters pass
        # FIXME? alternately all must be true instead of any being true?
        if prefix is not None and prefix in self.curies:
            return True
        if categories is not None and prefix in self.categories:
            return True
        return False

    def query(self, *args, **kwargs):  # needs to conform to the OntQuery __call__ signature
        # TODO
        pass

class InterLexRemote(OntService):  # note to self
    pass

class rdflibLocal(OntService):  # reccomended for local default implementation
    graph = rdflib.Graph()
    # if loading if the default set of ontologies is too slow, it is possible to
    # dump loaded graphs to a pickle gzip and distribute that with a release...

    def add(self, iri, format):

    def setup(self):
        pass  # graph added at class level

    def dispatch(self, prefix=None, category=None):  # return True if the filters pass
        # TODO
        raise NotImplementedError()

    def query(self, *args, **kwargs):  # needs to conform to the OntQuery __call__ signature
        # TODO
        pass

Users using the defaults would never have to deal with this. Looking at this it does seem like it might make sense to decouple the curies implementation into OntCuries or something like that, can discuss in more detail if the overall approach seems tractable.

Query API

query(term='brain') -> OntTerm('UBERON:0000955', label='brain') I prefer a unified query interface which can manage ontology discovery for the user rather than requiring explicit import and instantiation of individual ontology query classes. This is aided by the fact that OntServices can provide OntQuery with the information that it needs to do this. OntQuery can also provide an underlying interface on which to build the individual ontology query functionality if so desired.

I think the keywords can be used to enable a wide range of query options (for simplicity sake I'm basically matching the functionality that SciGraph already provides).

class OntQuery:
    def __init__(self, *services, prefix=None, category=None):  # services from OntServices
        # check to make sure that prefix valid for ontologies
        # more config
        self.services = services

    def __iter__(self):  # make it easier to init filtered queries
        yield from self.services

    def __call__(term=None,      # put this first so that the happy path query('brain') can be used, matches synonyms
                 prefix=None,    # limit search within this prefix
                 category=None,  # like prefix but works on predefined categories of things like 'anatomical entity' or 'species'
                 label=None,     # exact matches only
                 abbrev=None,    # alternately `abbr` as you have
                 search=None,    # hits a lucene index, not very high quality
                 id=None,        # alternatly `local_id` to clarify that 
                 curie=None,     # if you are querying you can probably just use OntTerm directly and it will error when it tries to look up
                 limit=10)
        kwargs = dict(term=term,
                      prefix=prefix,
                      category=category,
                      label=label,
                      abbrev=abbrev,
                      search=search,
                      id=id,
                      curie=curie)
        # TODO? this is one place we could normalize queries as well instead of having
        # to do it for every single OntService
        out = []
        for service in self.onts:
            if service.dispatch(prefix=prefix, category=category):
                # TODO query keyword precedence if there is more than one
                for result in service.query(**kwawrgs):
                    out.append(OntTerm(query=service.query, **result))
        if len(out) > 1:
            for term in out:
                print(term)
            raise ValueError('More than one result')
        else return out[0]

Examples

query = OntQuery(localonts, remoteonts1, remoteonts2)  # provide by default maybe as ontquery?
query('brain')
query(prefix='UBERON', id='0000955')  # it is easy to build an uberon(id='0000955') query class out of this
query(search='thalamus')  # will probably fail with many results to choose from
query(prefix='MBA', abbr='TH')

uberon = OntQuery(*query, prefix='UBERON')
uberon('brain')  # -> OntTerm('UBERON:0000955', label='brain')

species = OntQuery(*query, category='species')
species('mouse')  # -> OntTerm('NCBITaxon:10090', label='mouse')

If OntQuery is implemented in this way one thing that it must do is fail loudly when it gets more than one result so that the user can select which term they want. That failure will need to provide them with the options to choose from, probably formatted as OntTerm('UBERON:0000955', label='brain') etc so that they can just paste in the result and not worry about it.

It might be possible to use a ranking of preferred prefixes based on some additional criteria for users that didn't want to specify prefix='MBA'. For example, if a user had already specified that they were working in mouse via query(term='mouse') -> OntTerm(curie='NCBITaxon:10090'), brain region queries would rank MBA more highly than uberon, so that a query of query('thalamus') would fail but return OntTerm('MBA:549', label='Thalamus (mba)') as the first term.

Notes on SciGraph

Term API

This is basically just a lightweight wrapper around an OntID that provides pretty printing of additional properties by using the query interface.

OntTerm(curie='UBERON:0000955', label='brain', definition='pile of meat that thinks') # not actual definition Tunable verbosity on __repr__ might be useful here, it could include any keywords that could be mapped to ontology predicates. For example one output level could repr as OntTerm('UBERON:0000955', subClassOf='UBERON:0000062').

class OntTerm(OntID):
    # TODO need a nice way to pass in the ontology query interface to the class at run time to enable dynamic repr if all information did not come back at the same time
    def __init__(self, query=None, **kwargs):  # curie=None, prefix=None, id=None
       self.kwargs = kwargs
       if query is not None:
           self.query = query
       super().__init__(**kwargs)

    # use properties to query for various things to repr

    @property
    def subClassOf(self):
       return self.query(self.curie, 'subClassOf')  # TODO

    def __repr__(self):  # TODO fun times here
       pass

Example

brain = OntTerm(curie='UBERON:0000955')
brain.subClassOf  # -> OntTerm('UBERON:0000062', label='organ')

Ontology ID API

OntID('UBERON:0000955') or OntID('http://purl.obolibrary.org/obo/UBERON_0000955') or OntID(prefix='UBERON', id=0000955). If you want to live dangerously the simplest API is the bare string representation (I have done this and it leads to pain). This provides a light wrapper around the bare string representation and a way to interconvert and combine various representations. This would could interact with the OntCuries class mentioned above to check the shortened names. OntCuries would also be needed to have OntID and OntTerm str() to the desired format if NWB is serialized the way I think it is, but I could very well be wrong because I don't have the full details (see question 2).

class OntID(rdflib.URIRef):  # superclass is a suggestion
    def __init__(self, curie_or_iri=None, prefix=None, id=None, curie=None, iri=None, **kwargs)
       # logic to construct iri or expand from curie to iri or just be an iri
       super().__init__(uri)

This allows construction via curie or iri without the user having the fight the API if they need to interact directly.

Alternatives. Using an rdflib namespace it is possible to enter an uberon identifier as follows UBERON['0000955']. For smaller ontologies it is possible to use closed namespaces to automatically verify that the identifier is valid. Unfortunately python doesn't support bare words for UBERON.0000955 so the other option is to allow users to ender the curied form as a string 'UBERON:0000955'. String representations of curies that are expanded to validate basic correctness is what I have done with neuron lang and it works quite well because many identifier sources provide a curied representation which can be pasted in. See [1] for elaboration on this approach.

Concerns

My primary concern with this approach is how to communicate to the user what ontologies are available by default. Discoverability is not easy in this context. One potential solution would be to have a method OntQuery.ontologies that returned the list of prefixes, or even have OntQuery auto generate subclasses of itself that could be used like OntQuery.UBERON(''). I think this needs a bit more thought if you want to go this route.

rdflib can be quite slow to load large graphs on a stock cpython interpreter, but if you can use pypy3 it is quite a bit faster. Not sure if this is relevant but thought I would mention it.

Footnotes

  1. There is a much longer discussion that could be had about what to do with user supplied ontologies that do not have uris (the InterLexRemote points to one potential solution).
  2. Some thoughts on what OntCuries could provide.

Curies can also be created against a full url, so for example a user that is tired of typing 'UBERON:0000955' over and over could define a curie 'brain':'http://purl.obolibrary.org/obo/UBERON_0000955' and then just use 'brain:' in the file where they need it e.g. as

electrode_location = ['brain:']

A more compact way to define user specific curies would be

defLocalName('brain', 'UBERON:0000955')

or even

defLocalName('brain', OntQuery(prefix='UBERON', label='brain'))

To take this approach to API building to, shall we say, non-pythonic, ends the full extension of this is to take full url curies and turn them into python identifiers like I do in phenotype namespaces for neuron lang (implemented here). This would let a user write

electrode_location = [ontNames.brain]

or

with myOntologyNames:
    electrode_location = [brain]

or

setLocalNames(myOntologyNames)
electrode_location = [brain]

This which introduces more complexity than is worth it becuase it is basically just a check on whether the identifier has been defined and on average only saves 3 chars (' ' and :) of typing on average.

The cognitive overhead of working in this way may not be worth it though since the implementation to pull that off does nasty things like inspecting the stack to set and unset/restore globals which can cause confusing bugs.

bendichter commented 6 years ago

First, to answer your questions:

  1. Right now the external dependencies are pretty light, only including libraries that come default with Anaconda, however this discussion is quickly getting complicated enough that an external library might be more appropriate.
  2. Our primary backend right now uses hdf5, which gives us a lot of flexibility for how we write these terms to disk. I think I'd like to start by figuring out what attributes we need to store and what a good workflow would look like, then choose the best way to store the data.

It seems to me we can break this conversation into two interrelated but separable questions:

  1. How does a user find the appropriate term from across several ontology databases?
  2. How do we store a term reference in NWB?

Your system for 1 seems really well thought out. I like the idea of having a service framework capable of searching across ontology databases with a single query, and the keyword argument parameters would allow a user to easily hone in on the relevant terms. It seems like this API system would be best as a stand-alone project. It seems like it would require a decent amount of maintenance, with API calls to multiple databases. Other projects would clearly benefit from an API like this, and it doesn't make sense to make them import pynwb as well. If this is a stand-alone project that others can use, it will make NWB more interoperable with those projects as well.

So for 2, if I understand, we need to store the following attributes for OntTerms:

  1. the database used
  2. the curie rule of the database used
  3. the id (curied)
  4. label

Is that it?

I think the ideal implementation for NWB would be to be interoperable with the querying package but not strictly depend on it, since we want to keep strict dependencies pretty light. To accomplish that, we could make a OntTerm object in NWB that can be initialized by converting your OntTerm:

this_term = pynwb.OntTerm(NeuroOnt.OntTerm('UBERON:0000955', label='brain'))

This would require NeuroOnt.OntTerm (the package name is a placeholder) to have the 4 attributes above. I think this would be the easiest way to add an entry, given that the search API you outline works as intended. Maybe NWB could store curie rules for a set of databases?

Another option for the user would be to input the info manually:

this_term = pynwb.OntTerm(database='Uberon', id=0000955, label='brain', curie_func=curie_func)

which they could technically do without NeuroOnt. One tricky thing here is I don't think hdf5 provides an easy way to store functions. Maybe we could pickle it and store it that way?

Maybe it would make sense to store a few of the most common terms like "mouse" and "CA1" which would simply require:

this_term = pynwb.OntTerm('mouse')
oruebel commented 6 years ago

@tgbugs @bendichter thanks for the interesting pointers and discussion. Let me add @lydiang to this thread. We have been discussing the topic for supporting ontologies on several occasions and it is certainly part of our future road map for NWB. I agree that this is an important topic, but due to the complexity and effort required, I believe that we will not be able to resolve this issue in time for NWB 2.0 but that this will need to be a new feature for 2.x. I would suggest that we discuss this topic with the TAB and other at the upcoming Allen Hackathon.

tgbugs commented 6 years ago

Please let me know if my thinking and explanations here make sense.

Apologies if you already know about some of the things I discuss below. I have spent long enough being confused about ontology ids that I try to spare others the pain if at all possible.

Summary.

  1. The need to validate ontology identifiers means that completely decoupling ontology query from the pynwb core is unlikely to be possible because calls to pynwb.OntTerm need to check their arguments after they have been passed in.
  2. If pynwb can't depend on external packages it will probably have to implement much of the functionality itself and then expend maintenance effort keeping things in sync.

@oruebel This definitely introduces complexity, but maybe not as much as it seems (having been thoroughly nerd sniped by @neuromusic (notch one up) I may just implement this this weekend). However, I suspect in light of the issues in my summary it might force the issue of whether/how the pynwb core will depend on non conda core libraries. Not knowing your internal processes or timelines, I imagine that could delay things quite a bit even if all the code were already written.

@bendichter

Questions

  1. Do users ever look at the raw hdf5 loaded via h5py?
  2. Are non-python implementations of nwb possible?
  3. Clarifying question about python -> hdf5 conversions. How would you serialize something like myid = OntId('id:123') hdf5 if we think of it as a string + a type?

Thoughts

I think that there is a third issue of how to validate ontology identifiers entered into an NWB file which does not fit nicely into either querying or stored representation.

Validation

Validation could be part of querying but if querying remotes is completely removed from the NWB core then validation has to be considered as a separate issue.

Basically if you want validation for your ontology terms pynwb.OntTerm has to have a way to hook into a validation service after arguments have been passed to it.

Unfortunately this means that a call like

this_term = pynwb.OntTerm(NeuroOnt.OntTerm('UBERON:0000955', label='brain'))

is not a viable solution. If pynwb does not incorporate the machinery to validate terms itself then it has to trust that the user is giving it good data :( I do not see a way around this.

Validation for ontology terms usually amounts to picking a set of ontologies that an iri must exist in and then asking 'Does this iri exist in the graph defined by those ontologies?' If yes 'Does it match what I was expecting?' The second part has to be left to the user but we can warn them if something does not match or error if the term was unvalidated.

I propose a solution below in implementation details which is to flag terms that have not been validated so that they can be checked next time the NWB file is loaded with access to the defined set of ontologies. With a caveat that a locally defined ontology probably should not count (things can be done to make it easier to map local nomenclature to more formal terms).

Querying

Querying is not as complex as I made it out to be. It only needs two levels. Local, and remote. In an abstract sense the remote is what defines the set of valid ontology identifiers for a given NWB file, so the idea that there is more than one remote is an implementation detail and almost never happens. An implementation needs to implement one local and one remote. If the user wants to use any and all possible terms including the garbage ones, then they fall through to bioportal.

Determining which ontologies will be supported and providing infrastructure for hosting the remote is a separate concern.

Choosing the set of sane defaults can then be dealt with as two questions.

  1. What ontologies or subsets of ontologies should be distributed with a release for the local?
  2. What is the minimal set of ontologies that should be loaded on the remote for it to be considered to be NWB compliant?

This does not mean that some facet of NWB would have to actually run one of those remotes (unless it wanted to). More that NWB could provide a strong set of recommendations, perhaps bordering on "If you don't include these ontologies we cannot list you in our communal list of remote providers because there are NWB files that use terms from those ontologies and you must provide data for them."

To this point, in my I run SciGraph as a 'remote' service for the NIF ontology on all my development machines (especially my laptop) because it makes using the ontology easier.

Decoupling?

I agree that a standalone query plugin would be useful to many projects. In addition I think it would be equally useful to have a terminology creation/registration plugin for cases where users cannot find a match so that they can get identifiers for their terms at data collection time or even at schema specification time (full disclosure: this is one potential use case for InterLex). See footnote [0] for more on this.

However, despite my initially though that complete decoupling was possible, having thought more about it, and confronting the issues with validation, I'm not sure that it is possible, assuming pynwb wants to have ontology support beyond entering raw OntIDs in the decoupled state. The functionality mentioned for storing and accessing the most common terms ala OntTerm('mouse:') will require replicating most of the API, albiet in a stripped down form. The default functionality would allow users to enter terms but they would all be flagged as unvalidated, and would allow other users to view terms in NWB files that they open, but only the minimal metadata associated with the term.

If no local functionality is desire I think the solution is equivalent to what is needed to support the desired representation in hdf5. This could be as small as having class OntID(str): def __init__(self, iri): super().__init__(iri) which only accepts full iris. In the middle with curies, OntID and OntCuries, but no human readable terms. Or maximally with OntID, OntCuries, and OntTerm. OntTerm would accept everything but flag it as unvalidated because it had not been passed. In order to validate terms at some point OntTerm will have to implement a way (via __new__?) to pass in something that supports queries, preferably in a transparent way that requires no invervention by users.

As mentioned above, external queries cannot return values into pynwb.OntTerm, pynwb.OntTerm must make use of external query services to validate identifiers it receives and fail loudly if a user provided label does not match on an unvalidated term (very important for avoiding off by 1 errors). It also needs to warn if a validated term has changes.

The coordination issue here if pynwb.OntTerm is maintained without depending on an external package will be to make sure that pynwb.OntTerm.query stays in sync with the implementation of ontquery.OntTerm.query so that it can issue validation queries without users having to intervene and change code.

If ontology functionality is desired in the decoupled state, then pynwb will need to replicate OntID, OntTerm, OntCureis, OntQuery, and OntService. It might be possible to stuff OntService into OntQuery and rename it to something else, but all the functionality will still need to be there.

I see two options in light of this.

  1. Depend on something outside conda core that can be used as a point of synchronization
  2. Implement the stripped down version of all of these and keep their APIs in sync manually.

Other solutions either require the user to rewrite their code when they want additional functionality, or they require one side to manually keep everything in sync. Other options, like just keeping a dictionary around with local terms will lead to nasty technical debt (as I have discovered having made that mistake myself).

Therefore I think a standalone repo that has no dependencies might make the most sense so that pynwb only has to to depend on one thing outside conda core and others don't have to depend on pynwb. If that is still too much the other way to implement this for pynwb would be as a git submodule or something similar. In this minimal ontquery implementation the only subclass of OntService would BasicService implemented below.

class Graph():
    """ I can be pickled! And I can be loaded from a pickle dumped from a graph loaded via rdflib. """
    def __init__(self, triples=tuple()):
        self.store = triples

    def add(triple):
        self.store += triple

    def subjects(self, predicate, object):  # this method by iteself is sufficient to build a keyword based query interface via query(predicate='object')
        for s, p, o in self.store:
            if (predicate is None or predicate == p) and (object == None or object == o):
                yield s

    def predicate_objects(subject):  # this is sufficient to let OntTerm work as desired
        for s, p, o in self.store:
            if subject == None or subject == s:
                yeild p, o

class BasicService(OntService):
    """ A very simple services for local use only """
    graph = Graph()
    predicate_mapping = {'label':'http://www.w3.org/2000/01/rdf-schema#label'}  # more... from OntQuery.__call__ and can have more than one...
    def add(self, triples):
        for triple in triples:
            self.graph.add(triple)

    def setup(self):  # inherit this as `class BasicLocalOntService(ontquery.BasicOntService): pass` and load the default graph during setup
        pass

    def query(self, iri=None, label=None):  # right now we only support exact matches to labels
        if iri is not None:
            yield from self.graph.predicate_objects(iri)
        else:
            for keyword, object in kwargs.items():
                predicate = self.predicate_mapping(keyword)
                yield from self.graph.subjects(predicate, object)

        # Dispatching as describe previously is dispatch on type where the type is the set of query
        # features supported by a given OntService. The dispatch method can be dropped from OntQuery
        # and managed with python TypeErrors on kwarg mismatches to the service `query` method
        # like the one implemented here.

Stored representation

My summary is to store OntTerm('id:123', label='readable', unvalidated=True). Namely the iri, the label, flag unvalidated terms, with the curies store independently. This is sufficient for users with installations that do not make use of a query plugin, and can be validated by those that do.

  1. You do not need and do not want the database used. Or rather, there are no databases, only urls/iris (and literals) which are nodes in a big graph of indeterminate size. First, curie prefixes like UBERON are namespaces not databases, and often generated automatically as ns1:0000955. See [1] for more on ontology identifiers. Second, if you think of places that let you search the subgraph for your ontology ids as databases then we also don't want to store that information because it will inevitably change in should be handled by the NWB implementation. See [2] for more on the datastore aspect.
  2. Yes. In theory you can store as many sets of curie rules as you want so long as the identifiers using those curie rules can be mapped back to that set of rules. In practice I imagine that only a single set of rules would ever be used, but because ontology identifiers are not the default data type in NWB (they are in rdf) any implementation that works for a single set of rules should work for multiple sets of rules.
  3. Not quite. As mentioned in [2] all ontology identifiers that NWB will ever see are fully expanded iris. When they have a string type in python they look like 'http://purl.obolibrary.org/obo/UBERON_0000955'. They do not have to be serialized this way, but if they are not serialized as full iris they must be able to be expanded back into the full iri representation. If a curie rule exists then the id can be shortened, but there are cases where there may not be a curie rule, in which case the iri would be stored. This is not really an issue because ontology identifiers need to be typed as such anyway. A corollary of this is that NWB would not need to store the curie rules for the databases because all the databases should provide a way to query using the iri alone (since it is the real identifier). The only reason to store curie rules is because they are conventionally used to make identifiers a bit more manageable for humans to work with.
  4. I think the label is the minimal amount of information needed for human readability, so it is probably worth storing it as 'cached' information to aid human readers. Ontology labels don't usually change, but they can. There are very few cases within a single domain where a term nees more than its label to be unambiguous.
Implementation details.

First a clarification about how I imagine OntTerm being used. The only meaningful value associated with an OntTerm is its iri (uri in my previous comment). In my examples I was showing how OntTerm should repr. If you call the repred form it will discard everything except for the kwargs that are passed to OntID for init and then use the iri to query the local or remote for additional values mapped to kwargs.

If the iri cannot be found via available services then the values will be preserved and we can add a flag that is this_term.unvalidate = True or something so that versions of pynwb with query support will know to check whether it is valid and raise warnings if they cannot find the term in the defaults etc.

In general the interface to pynwb.OntTerm() should be the same regardless of whether the query plugin is loaded or not so that users don't have to refactor when they add an enhancement and due to the need to validate terms.

Given the issues discussed above with validation,

this_term = pynwb.OntTerm(NeuroOnt.OntTerm('UBERON:0000955', label='brain'))

can be reduced to

pynwb.OntCurie('UBERON', 'http://purl.obolibrary.org/obo/UBERON_')  # could be loaded by default

# ...

this_term = pynwb.OntTerm('UBERON:0000955')

or just

this_term = pynwb.OntTerm('http://purl.obolibrary.org/obo/UBERON_0000955')
# or in the minimalist case
this_term = pynwb.OntID('http://purl.obolibrary.org/obo/UBERON_0000955')

if the query plugin is loaded, or if uberon's brain is included in the minimal local set then the following would be possible

In  [1]: repr(this_term) 
Out [1]: OntTerm('UBERON:0000955', label='brain')

if a query plugin is NOT loaded and there is no local store

In  [1]: pynwb.OntTerm('UBERON:0000955', label='not actually the brain')
Out [1]: OntTerm('UBERON:0000955', label='not actually the brain', unvalidated=True)

is what would happen. This would also happen if there were a small local store and the user sets OntTerm acceptance level to 'open' (open world) or similar so that pynwb doesn't error on no match. If it is set to 'closed' and uberon's brain is in the local store it will behave as this example will when validation is run, which is that it should (probably) raise a TypeError, or a ValueError stating "Unvalidated label 'not actually the brain' does not match 'brain' on 'UBERON:0000955'".

Footnotes

  1. The complement to querying for existing terms is creating (or registering) new terms when users cannot find the term they are looking for pynwb sits in an ideal spot to empower users to do this. At the moment this is out of scope, but once ontology support is in it is the logical next step. I bring this up to mention that supporting remote infrastructure for adding new terms is much more complicated than for hosting existing terms, and in our experience many organizations that initially want formal terminology support also end up wanting the ability to add their own terms as a way to regularize the terminology used within their organization and with collaborators from outside.
  2. Iris do not have meaning to the information system beyond their use as a unique identifier and curies do not have meaning beyond their expansion rule. For example, Uberon is the name of the project for the ontology, all the identifiers happen to have UBERON_ included due to obo conventions. You only need the ontology identifier which acts as a globally unique primary key for that term (though there may be many different copies of the records attached to it). To reiterate, in a sense the whole point of web ontologies is for there not to be a single database, just a bunch of iris that can have data 'attached' to them by virtue of being only a single edge away from them in the graph. If the iri resolves to something then that is considered the canonical source, but I can take that iri and change my /etc/hosts file and have a completely different view of that iri. So for example you don't have to know that uberon exists at all in order to make use if its terms through the NIF ontology and you don't have to go to uberon in order to get data about uberon identifiers. For example if I scrambled all the urls to look like 'http://a8f3.v1qf/d507fd416cfa6c0520b7b1b59b8ee5c83f5ec475' they would still be valid and I could shorten them to 'ONT:d507fd416cfa6c0520b7b1b59b8ee5c83f5ec475'. Any database can use those strings of gobldegook as primary keys and return information associated with them (it just so happens that humans are weak an cannot remember 128 digit hexadecimal numbers so we use urls that are more readable, but to the information system they are just as opaque as a sha1sum).
  3. Said another way, the NWB file should not keep track of the database it got something from, because there is a good chance that that database will cease to exist but there will be many others than can resolve that identifier. The NWB implementation is where the information about the database should be stored. I think I may have caused this confusion by implying that there could be multiple databases, but that is just an implementation detail. Since ontology identifiers are globally unique the database or set of databases you query for a term just affects the portion of all data associated with it from across the entire internet that you see. For essentially all ontologies that are relevant for neuroscience that identifier is an iri (url). That iri should be actionable/resolvable/searchable independent of any database hosting the content attached to it. How those identifiers are resolved/expanded can then be NWB implementation dependent. Therefore, even if a url itself would 404 the NWB implementation can intercept it and redirect to a local copy of the record attached to it. One of the other things that I work on is how to deal with those 404ing identifiers so that users like NWB don't have to worry about it. As a result the implementation dependent resolving behavior is actually more important for use cases where an organization has attached additional information attached to that identifier. For example, what NWB files created by that organization have used that ontology identifier, or where to find the stock of that reagent in the lab.
oruebel commented 6 years ago

@ tgbugs I have not head a chance yet to carefully read all of your last comment but let me quickly answer your questions.

Do users ever look at the raw hdf5 loaded via h5py?

Currently I think the answer is yes, although not necessarily using h5py. E.g., some folks are using HDFView to explore the low-level structure of the format in HDF5. As folks get more familiar with the format and are using it for analysis (rather than trying to develop converters) I think this will change, but my guess is that for debugging purposes folks will sometimes use low-level tools (e.g., h5py).

Are non-python implementations of nwb possible?

Yes. The core concepts of NWB like the specification language are programming-language agnostic and HDF5 is available for most popular programming languages as well (C,C++, Fortran, Matlab, R, Python, ...). In terms of APIs, e.g., there is also the MatNWB Matlab API that is in active development.

How would you serialize something like myid = OntId('id:123') hdf5

There are a couple of ways you could do this: a) as plain string, b) as a YAML or other document string, c) as a series of attributes/datasets that would describe the id, type etc. Generally, options b or c are best because they are more self-describing. The advantage of c is that you can describe it easily in NWB-schema without introducing a custom "sub-format" but depending on use I think b) can be a good option too.

I'll have a closer look at this thread later this week. @krisbouchard @ajtritt @lydiang could you please also have a look at this issue.

tgbugs commented 6 years ago

I have a basic implementation up and working at https://github.com/tgbugs/ontquery though at the moment to get any real functionality it still depends on https://github.com/tgbugs/pyontutils.

I have been dogfooding different ways of using it, some of which can be seen here https://github.com/tgbugs/pyontutils/blob/master/pyontutils/methods.py#L61. The actual workflow of looking up terms using OntTerm and replacing the code with a version that has an identifier can't be seen in the file as it is now because the errors would prevent it from running.

The way I incorporate the functionality into pyontutils can be seen here https://github.com/tgbugs/pyontutils/blob/master/pyontutils/core.py#L744-L750.

bendichter commented 6 years ago

@tgbugs This is great, Tom! I'll test this out and let's work together to make this integrate smoothly with NWB in the future. I see you have used

def __setitem__(self, key, value):
        raise ValueError('Cannot set results of a query.')

to enforce the validation against a database. That could work, but we have to be extra careful that the package works as intended, because we've taken away some of the ability of a user to go in and fix it. I suppose a truly stubborn user might get around this by cloning ontquery and removing those lines before installing, but that's a sufficient barrier that it would probably be easier to just use the package as intended.

Let's move discussion over to the ontquery issues page.

petersenpeter commented 6 years ago

Here is the json file describing the brain structure graph used for the mouse Allan atlas: http://api.brain-map.org/api/v2/structure_graph_download/1.json

There is more information on the Allan brain atlas ontologies here: http://help.brain-map.org/display/api/Atlas+Drawings+and+Ontologies#AtlasDrawingsandOntologies-StructuresAndOntologies

tgbugs commented 6 years ago

I pull the Allen into the NIF Ontology here. Working on a new version of the ingest that automatically does them all in one shot.

bendichter commented 4 years ago

@satra do you think we might be able to use the stuff @tgbugs was developing here for our current project?

satra commented 4 years ago

@bendichter - i'm sure we can use anything that has been developed. no reason to reinvent wheels ;)

i do think there needs to be an interface that bridges between the experiment and the nwb files, since in many cases the metadata will be consistent across nwb files from a given experiment. so any effort should focus on file specific metadata vs across files metadata.

yarikoptic commented 1 year ago

Documenting for myself ![Uploading IMG_20221113_152631705.jpg…]()