Materials-Consortia / OPTIMADE

Specification of a common REST API for access to materials databases
https://optimade.org/specification
Creative Commons Attribution 4.0 International
83 stars 37 forks source link

Support sub-databases within each database #46

Closed giovannipizzi closed 5 years ago

giovannipizzi commented 6 years ago

This issue suggests a possible implementation of the concept of sub-databases within a database provider registered in the OPTiMaDe specification.

This is useful for database providers (like e.g. Materials Cloud) where there isn't necessarily a "main" database, but many databases, all however provided by the same database provided (Materials Cloud, in this case), exposing the same API, but managed independently by different researchers (as an analogue, imagine GitHub repositories, where each user can manage his own GitHub repository, but the API is provided for all DBs by GitHub itself).

While this could be achieved by implementing the current API for each sub-DB, this would require to register each of them with the current specification (that, at the bottom, lists DB providers and their URL) but this is not a viable approach for cases where the sub-DBs within one registered DB provider can be added and removed by users at any time.

This proposal has the following aims:

As side-effect:

Proposed implementation

  1. Implement the same API (the one already defined by OPTiMaDe) for each sub-DB, where each sub-DB has a different base_url, for instance:

    (versions might be different for each DB - no requirement here). It is up to the implementation to decide the best approach to reuse the API code.

  2. In the OPTiMaDe API specification Appendix, where now we list only a prefix with a generic name of the DB, e.g.:

    _exmpl_: used for examples, not to be assigned to a real database

    replace this with a common, well defined API endpoint, e.g. http://www.example.com/optimade-discovery/, like

    _exmpl_: http://www.example.com/optimade-discovery/ (used for examples, not to be assigned to a real database)

    whose format we describe below. Ideally, I think we should also add a machine-readable list for automated discovery.

  3. The http://www.example.com/optimade-discovery/ returns generic information on the given DB provider, about the sub-DBs, and in the future can be also extended (this is e.g. a good place to add siblings as discussed in #22). Moreover it can also contain DB-specific extensions using the prefix as already possible in the OPTiMaDe API.

    1. The format of the output of the discovery endpoint could be something like this (this is not meant to be an actual format specification, but only an example to discuss the API content - once we make this issue an accepted feature, we will discuss the format in more detail):
    {
       'db_provider_name': 'Example',
       'db_provider_description': 'DB used for examples, not to be assigned to a real database',
       'db_provider_homepage': 'http://example.com',
       'db_provider_optimade_prefix': 'exmpl',
       'sub_databases': {
            "main": {
                "base_url": "http://example.com/mainDB/optimade/"
            },
            "test": {
                "base_url": "http://example.com/testDB/optimade/"
            },
            "third": {
                "base_url": "http://example.com/AThirdDB/optimade/"
            }
       },
       'default_sub_database': 'main',
       'optimade_test_sub_database': 'test',
    }
    • the default_sub_database is optional, should be provided only for DBs that want to have clients point by default to a given DB.
    • the optimade_test_sub_database is optional and should be exposed only by those sub-databases that expose a test DB as discussed in #44.
    • we can include additional metadata, e.g. the version of this information etc., if we think it's useful.
    • I made each sub-DB a dictionary if in the future we want to have more metadata here (e.g. if in the future we want to have multiple test DBs of different formats, if we want to say if the sub-DB is open or requires authentication, if there is a suggested indication on rate-limiting for querying, etc.)
    • Probably we will need to implement a small extension to my example above to support pagination of the sub-DBs.
    • as noted above, this response can be hardcoded in a static JSON file if the list of sub-DBs does not change (quickly) over time.

Finally, additional side-effects:

giovannipizzi commented 6 years ago

If you agree on the main idea, I would add this to the 1.0 release (as this is essential for us) and we iterate via comments on the actual response format.

ltalirz commented 6 years ago

Sounds good to me!

Two minor comments:

rartino commented 6 years ago

This is nicely thought-out, and I like the careful thought on useful members in point 4.

However, is there a good reason not to do this inside the OPTIMaDe API itself, i.e., more in line with the 'children' mechanism in #22?

More specifically, one can follow the spirit of #22 but mimic this proposal as closely as possible by:

This would lead to an /info format like this:

{
  "data": [
    {
      "type": "info",
      "id": "/",
      "attributes": {
        'db_provider_name': 'Example',
        'db_provider_description': 'DB used for examples, not to be assigned to a real database',
        'db_provider_homepage': 'http://example.com',
        'db_provider_optimade_prefix': 'exmpl',
        "api_version": "v0.9",
        "available_api_versions": {
          "0.9.5": "http://db.example.com/optimade/v0.9/",
          "0.9.2": "http://db.example.com/optimade/v0.9.2/",
          "1.0.2": "http://db.example.com/optimade/v1.0/"
        },
      },
      "sub_databases": {
         "main": {
             "base_url": "http://example.com/mainDB/optimade/"
         },
         "test": {
             "base_url": "http://example.com/testDB/optimade/"
         },
         "third": {
             "base_url": "http://example.com/AThirdDB/optimade/"
         }       
      }
      "formats": [
        "json",
        "xml"
      ],
     // ...
     }        
  ]
}

And we can then wait with, e.g., "siblings" etc., if that seems a bigger step at this time.

Isn't this more straightforward than to, technically, create another new (and non-jsonapi confirming) indirection API on top of the OPTIMaDe API?

(I'll add that I intentionally left out optimade_test_sub_database, but it can of course be included under attributes if it must. But, if this attribute only points at a single database being the test database, cannot that just be forced to always be the "test" key of sub_databases? Is there really a need for people to label a database "test" without it taking this meaning?)

giovannipizzi commented 6 years ago

Thanks @rartino @ltalirz for the comments.

To answer these points

I believe all OPTiMaDe APIs should implement the discover endpoint [...] i.e. your example url could read http://example.com/mainDB/optimade/v0.9/discover

Add all the info suggested in point 4 to the regular OPTIMaDe /info endpoint output.

I'm ok with both comments (and I agree that avoiding to define an additional API is better if possible), with the two following caveats (ok for us, but just for discussion):

Moreover, this technically allows any level of nesting for sub-DBs (as a sub-DB can now expose further sub-DB). Is this something we really want to allow? (summarising: the difference between the two proposals are mainly:

giovannipizzi commented 6 years ago

(I'll add that I intentionally left out optimade_test_sub_database, but it can of course be included under attributes if it must. But, if this attribute only points at a single database being the test database, cannot that just be forced to always be the "test" key of sub_databases? Is there really a need for people to label a database "test" without it taking this meaning?)

I agree to leave this out until we agree on #44. However, I don't think we should make restrictions on how the sub-DB namespace is used by each DB. Some motivations:

ltalirz commented 6 years ago

this requires all DBs to implement the concept of multi-sub-DBs in the API,

The basic concept, yes. In order to reduce implementation effort, one could allow for a simplified form of the info/discover endpoint, if there is just one database to serve.

and each sub-db should know about all the siblings

Why? mainDB could know about the others, while all of the others could know just about themselves.

this might suggest that the 'main' DB is a 'meta' DB gathering content from all sub-DBs. As we discussed, this is potentially bad because a client looping over all clients might retrieve multiple times the same data.

Well, I see the meta DB as a service that you can choose to use or not. I.e. as a client

In logical terms, this problem is solved by adding a meta label to meta DBs that clients can use to make this decision (the 'leafs of the DB tree' would not have this label).

Still, it would be nice to talk to someone who knows a bit about distributed databases. I imagine this problem has already been solved, perhaps in a different way.

rartino commented 6 years ago

@ltalirz are there major benefits with a /discover endpoint over having this info as optional members under the /info endpoint? Advantages I see with /info:

@giovannipizzi

this requires all DBs to implement the concept of multi-sub-DBs in the API, and each sub-db should know about all the siblings (for us, anyway, this wouldn't be a problem so I'm ok with it)

Somewhat echoing @ltalirz, I don't see why. A DB that doesn't care will just leave out the optional members in /info?

On the other hand, with a separate indirection API, every client now needs to implement handling of being given both indirection urls and direct OPTIMaDe urls.

Let me also suggest as an alternative that we could say that the sub_databases member does NOT represent a parent/child relationship, but rather that all sub_databases are supposed to indicate all the other ones as sub_databases. I.e., providing horizontal discovery rather than a tree. This might lead to a nicer end-user UI, where it doesn't matter which OPTIMaDe base_url a user has found, all alternative databases can still be discovered and exposed to them. This also removes the need to regard one DB as the 'top' one (though, one still have to make a choice of which base_url to put in the spec appendix.)

this might suggest that the 'main' DB is a 'meta' DB gathering content from all sub-DBs.

I don't see a difference in 'suggestion' between the two alternatives? (i.e., indirection API vs. indirection within the OPTIMaDe API). But I think we should write in the specification that "API implementations MUST NOT use the subdatabases relationship to indicate that one base_url aggregates data found at other base_urls. A facility to indicate an aggreagation-type relationships may be added to the specification at a later point."

Moreover, this technically allows any level of nesting for sub-DBs (as a sub-DB can now expose further sub-DB). Is this something we really want to allow?

I would suggest to work around this in the specification: "While the facility to specify subdatabase relationships technically allows such relationships to arbitrary depth, client implementations do not have to expose such relationships beyond the first level."

my proposal does not require to "define" a main DB (but allows optionally)

Everything else aside, it seems any facility for sub DBs will need to do a trade-off between:

  1. Add a new layer of indirection that all clients need to deal with - or -
  2. Have clients pointed to a 'default' (though, possibly empty) database from which other databases can (optionally) be discovered.

I see drawbacks to both approaches, but in my mind the comparison comes out somewhat in favor of '2' because it promotes client simplicity.

Finally, I would also like to add to the proposal the suggestion that sub_databases has a mandatory member for database_id:

"sub_databases": {
         "main": {
             "base_url": "http://example.com/mainDB/optimade/",
             "database_id": "_exmpl_xuowd3883281"
         },

Here 'main' is supposed to be a user-friendly name (e.g., exposed to users in a UI), and database_id MUST start with a database-specific prefix and be a unique identifier of this database. This requirement seems a good idea to prepare for future sibling discovery, de-duplication, and aggregation features.

giovannipizzi commented 6 years ago

@rartino thanks. I agree with having a database_id.

Regarding or your very insightful description of the trade-off between:

  1. Add a new layer of indirection that all clients need to deal with - or -
    1. Have clients pointed to a 'default' (though, possibly empty) database from which other databases can (optionally) be discovered.

I agree, even if I would be tempted to lean to 1 instead, because it makes sure clients know about the concept. Otherwise clients might not implement this and just think the DB is empty. In any case, for a "dumb" client (that if dumb anyway will probably implement just a minimal, possibly incomplete version of the specs, and therefore might miss the concept of sub-DB), once it discovers that there is a default sub-DB, it can start using that one directly and forget about the discover endpoint altogether (or even more generally, the discover could be read by a human the first time, and then a simple client can just point to the interesting endpoint).

A note, also, in case that my proposal wasn't fully clear: my suggested discover API only consists of that single endpoint and does not require to define an additional, empty DB that depending on the implementation might need to be defined, managed etc (that I see as a disadvantage).

ltalirz commented 6 years ago

@rartino Concerning discover vs info: I'd be fine with either solution

@giovannipizzi

my suggested discover API only consists of that single endpoint and does not require to define an additional, empty DB that depending on the implementation might need to be defined, managed etc (that I see as a disadvantage).

I don't believe there would be the need for an "empty database".

I think we should separate the concept of the API from the database(s) it queries. I agree that there is a distinction between:

  1. forwarding requests to other OPTiMaDe APIs
  2. translating requests to an internal language understood by the DB and querying a DB

We should think about how the API decides what to do, and whether 2. should be a "one-to-one" relationship between API and DB (which I think is what you are currently assuming) or whether we allow "one-to-many" (including "one-to-zero") for 1. and 2.

rartino commented 6 years ago

It seems we have at least two concrete suggestions for v1.0 with different drawbacks? Maybe it is time to bring this discussion to the email list? I'm imagining sending out something like this:


At the workshop it was discussed that for some database providers (e.g., Materials Cloud) there isn't necessarily a "main" database, but many databases over, e.g., very distinct data domains or which are managed independently. There is some urgency to get this in place for OPTIMaDe v1.0 to avoid ad-hoc solutions. A more detailed discussion is present at github issue #46 (\<url>).

At this point two main ways to handle a discovery mechanism for subdatabases are discussed:

  1. To say that when a client retrieves an OPTIMaDe base url, it may instread receive a very simple json-formatted "discovery" reply that points out the sub-databases. (Which often will be possible to implement as a static file.) In a sense this adds an extra separate "mini-API" for discovery.

- or -

  1. That the already existing /info endpoint in the OPTIMaDe API is amended to provide the same type of discovery info.

Notes about option 1:

Notes about option 2:

At this point it would be good to get high-level feedback from this email list about which option is the better to pursue. For those who want to engage in a deeper detailed technical discussion about precisely how either option is implemented, it may be better to let that continue in github issue #46 (\<url>).


(I'm assuming the above summary shows some of my bias for option 2, so, since @giovannipizzi is the top poster, I'm fine with him sending out an edited or rewritten version of this with his own bias instead.)

giovannipizzi commented 6 years ago

Thanks! Sent, with a small change (as probably this wasn't clear from my explanation): in option 1, the DB must always provide the discover endpoint (that will be the one specified in the optimade specs), even if there is only 1 DB. So clients do not have to guess what type of endpoint it is.

If we don't want to enforce this level of indirection for the clients (that anyway, for DB providers with a single default DB, needs only one single request, and then the default base_url endpoint can be cached and used directly forever in the future, and the discover never used anymore - or even read by a human, and then used as the base_url for the client): then we can also specify in the specs an (optional) default endpoint (even if I wouldn't do so).

fawzi commented 6 years ago

Somehow I had in mind something different with subdatatabases. The discovery is a separate issue, but for the database section I was thinking of simply having another optional database=XX query parameter to basically all requests, and leaving it out would select the default database...

ltalirz commented 6 years ago

After discussing this issue with Giovanni, it turned out that we had quite different ideas of what "sub-databases" meant and what the role of the API in addressing those would be. In order to avoid misunderstandings, I'd like to try and summarize the result of our discussion as briefly and clearly as I can.

Proposal 1 - Keeping a list of APIs

In its essence, Giovanni's suggestion simply is: keep a list of existing APIs. A minimal implementation of this suggestion would just be a file hosted somewhere that contains a list of all known base URIs - no change of the spec needed, no additional effort whatsoever. Each of the APIs in the list has one database it queries, and it knows just about that one. All work involved in querying multiple databases is left to the client. Personally, I would not even speak of "sub-databases" in this context because in this scenario there is no hierarchy - all APIs and all databases are created equal.

I understand why Giovanni did not see the necessity to include it in the OPTiMaDe API spec - in the end, one file is enough (could be hosted in the optimade git repo).

Proposal 2 - APIs that can query multiple data sources (skyscanner)

What I was associating with the term sub-database (and what others might have associated with the "skyscannner" analogy) is something more elaborate: one API being able to query multiple databases, either a. querying directly multiple postgres/mongodb/... databases known to it or b. delegating queries to multiple other OPTiMaDe APIs known to it

This is the skyscanner model and it shifts work from the client to the API server. In this scenario, clearly the API needs to know about these databases and it makes sense to extend the specification to have the API tell which databases it is querying (e.g. to help clients avoid querying the same data source multiple times).

Personal conclusion

Proposal 1: I believe that keeping a list of existing APIs (possibly with some additional metadata, to be decided) can't be wrong - it does not need a modification of the OPTiMaDe API spec, all it needs is a consensus on the information stored in the list. In principle, a list of OPTiMaDe base URLs would be enough since their /info endpoints can then be queried for more information - but one can decide to include a little meta information as well.

Proposal 2: This does call for a modification of the spec (e.g. providing info about queried sub-databases in the /info endpoint as suggested by @rartino, or extending the query language in order to be able to restrict to certain subdatabases as suggested by @fawzi ). It also requires some thought on how cases a. and b. should be handled (whether one should impose restrictions etc.) and how they would be implemented. Perhaps this can be deferred until we have an actual implementation of this case (I believe Materials Cloud might want to provide such a service).

giovannipizzi commented 6 years ago

Thanks! I think this summarises very well what I had in mind, and I would vote for going for 1 now, and moving 2 to a different issue to be discussed and agreed upon later.

rartino commented 6 years ago

@ltalirz Indeed, what you describe as proposal 2 is what I've used the term "aggregation" for above and in the discussion so far I've viewed it mostly as a separate issue (though, I did propose to add to the spec that the subdatabase feature MUST NOT be used to indicate aggregation...) Lets agree on moving all discussion of this feature into a separate issue.

Nevertheless, this confusion perhaps indicates that 'subdatabase' is not the best term for what we are discussing.

@giovannipizzi

If we don't want to enforce this level of indirection for the clients (that anyway, for DB providers with a single default DB, needs only one single request, and then the default base_url endpoint can be cached and used directly forever in the future, and the discover never used anymore - or even read by a human, and then used as the base_url for the client): then we can also specify in the specs an (optional) default endpoint (even if I wouldn't do so).

In my mind we need to have client implementations aware of the indirection, and re-querying it to discover changes. That way I see uses for this feature in the interaction between clients and end users.

If we don't do that, what is the practical use for this feature, compared to just a list of your various base_urls at your website, in a format written for humans? (We can give the link to that page in the spec.)

@fawzi a "database=XX" query parameter is indeed a straightforward approach. (But, we still need a discovery feature.) However, for someone with subdatabases that uses different backends, it may not be ideal to be forced to serve them all from the same base_url. The other way around, with separate base_url:s there is nothing preventing you to serve them from one implementation (even, from one url using e.g. url rewriting.)

giovannipizzi commented 6 years ago

In my mind we need to have client implementations aware of the indirection, and re-querying it to discover changes. That way I see uses for this feature in the interaction between clients and end users.

Indeed, you are right and I agree. I also see clients being able to parse this indirection level. My point was that, if one wants to start with writing a simple client, this level can be implemented at a later stage, so this suggestion does not make writing a simple client more complex.

rartino commented 5 years ago

We had some continued discussion about this on an OPTIMaDe call after Giovanni left.

In my impression, @sauliusg, @fawzi and I were all skeptical towards what is essentially a separate API for database discovery. Rather, we went in the direction of wanting to extend OPTIMaDe with an endpoint /databases for this. It would then be clearly specified that this endpoint supports no filtering or similar, its only function is to return a full list of OPTIMaDe 'base urls', which then is the same 'static json' that @giovannipizzi is asking for. This design allows to re-frame the issue of having to provide an "empty database" as one can now rather phrase this to say that the initial URL simply points to a database that only contains entries under /databases, but no structures or computations.

However, after leaving the call I thought some more about this. Specifically, what will I do in my setup the day I decide to provide multiple databases. With the "empty database" or "database of databases" solutions, I may need to suddenly introduce a new level of indirection, which breaks e.g. scripts of my users who previously have just connected to my base url and queried structures. Now that will suddenly return an error, and they have to discover that they need to go down one level in the hierarchy.

With @giovannipizzi's solution everyone is forced into a one-level URL indirection that makes it trivial to branch out into multiple databases at a later point. Maybe this forced indirection is actually a very good thing to avoid lots of breakage across database providers, who otherwise are likely to go through an evolution of "oops, now we need another level of indirection here".

sauliusg commented 5 years ago

Dear Rickard,

thank you for your comments!

My short notes to some points below:

On 2018-12-20 02:56, Rickard Armiento wrote:

However, after leaving the call I thought some more about this. Specifically, what will I do in my setup the day I decide to provide multiple databases. With the "empty database" or "database of databases" solutions, I may need to suddenly introduce a new level of indirection, which breaks e.g. scripts of my users who previously have just connected to my base url and queried structures. Now that will suddenly return an error, and they have to discover that they need to go down one level in the hierarchy.

That needs not to be so:

If OPTiMaDe API permits "/databases" endpoint, then any database can provide a list of databases and/or endpoints. So, your original database provides serving its contents as it used to provide (so all users' scripts keep working), and starts providing a list of new databases under "/databases" endpoint (that list was originally empty), so that users can discover dynamically new databases and endpoints.

Compatibility backwards is very important, nobody wants to change their software each time upstream breaks something, for sure :)

With @giovannipizzi https://github.com/giovannipizzi's solution everyone is forced into a one-level URL indirection that makes it trivial to branch out into multiple databases at a later point. Maybe this forced indirection is actually a very good thing to avoid lots of breakage across database providers, who otherwise are likely to go through an evolution of "oops, now we need another level of indirection here".

Forcing to do something (or, rather, restricting choice without necessity) is a Bad Thing (TM). One day we run into contradicting requirements, and have no good way to address them. Too bad.

Since OPTiMaDe REST describes several "REST endpoints", there is no contradiction to have also "/databases" endpoint to list all available databases, and "/endpoints" endpoint to provide the list of REST endpoints in a database; including itself :), the endpoint "/endpoints" (see how nicely recursive and self-describing it is!) .

In this case, MaterialsCloud is free to provide ".../materialscloud/optimade/v1.0/databases" endpoint (EP) to list all databases, in JSON or XML or whatever is in the specification, under a specified schema.  If ".../materialscloud/db1/" is one such database, then ".../materialscloud/db1/optimade/info" would be its mandatory info endpoint, and ".../materialscloud/db1/optimade/endponts" would list all EPs that can be querried; e.g. if ".../materialscloud/db1/optimade/structures" is one of such endpoints, I can submit a query ".../materialscloud/db1/optimade/structures?filter=elements="Si,C"+AND+nelements=2"to get all silicon carbide structures, for instance.

COD, however, would probably not have a dedicated "https://crystallography.net/optimade/databases" EP, but instead provide identical OPTiMaDe interfaces for all our databases: "https://crystallography.net/optimade/cod/databases", "https://crystallography.net/optimade/tcod/databases", "https://crystallography.net/optimade/pcod/databases", etc.

All this is fully automatable. It solves interface problems, if not sematical ones, IMHO.

-- Dr. Saulius Gražulis Vilnius University Institute of Biotechnology, Saulėtekio al. 7 LT-10257 Vilnius, Lietuva (Lithuania) fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353 mobile: (+370-684)-49802, (+370-614)-36366

sauliusg commented 5 years ago

On 2018-12-20 02:56, Rickard Armiento wrote:

Maybe this forced indirection is actually a very good thing to avoid lots of breakage across database providers, who otherwise are likely to go through an evolution of "oops, now we need another level of indirection here".

Thas why I would advocate for "/endponts" endpoint, rather than for "/databases" (although the latter is also useful). You would query "/endpoints", and get a list of all queryable OPTiMaDe endpoints on this site. If you want, you can recursively descend deeper if "/endpoints" is among the endpoints that were returned :)

-- Dr. Saulius Gražulis Vilnius University Institute of Biotechnology, Saulėtekio al. 7 LT-10257 Vilnius, Lietuva (Lithuania) fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353 mobile: (+370-684)-49802, (+370-614)-36366

giovannipizzi commented 5 years ago

@sauliusg the point here is that we also want to do what you do:

provide identical OPTiMaDe interfaces for all our databases:

i.e. we want to provide identical OPTiMaDe interfaces for all our databases:

The point is that you registered each of them already in the specifications:

_cod_: crystallography open database
_pcod_: predicted crystallography open database
_tcod_: theoretical crystallography open database

because in your case you have finite list and you don't expect to create a new database very often, and in that case it's ok to go back and make a change/PR to the specs.

In our case a new DB (think e.g. to a GitHub repo) is created by a user at any time, and we cannot register each of them in the optimade specs every time.

To contextualize in the case of COD/TCOD/..: would you be OK with registering JUST the following?

_cryst_: databases provided by crystallography.net

and then provide the list of the existing ones within the API you suggest?

When users go directly to the _cryst_ endpoint, and they find a default empty database for instance, I think this would be unexpected. Or you could pick the COD as the 'default' database (and maybe in your case it's ok) but then the PCOD, TCOD databases are very hard to reach if you don't know they exist and that you have to look for them.

rartino commented 5 years ago

Is there a reason not to close this issue now that a version of what was discussed has been merged?