Support sub-databases within each database

Materials-Consortia / OPTIMADE

Specification of a common REST API for access to materials databases

https://optimade.org/specification

Creative Commons Attribution 4.0 International

83 stars 37 forks source link

Support sub-databases within each database #46

Closed giovannipizzi closed 5 years ago

giovannipizzi commented 6 years ago

This issue suggests a possible implementation of the concept of sub-databases within a database provider registered in the OPTiMaDe specification.

This is useful for database providers (like e.g. Materials Cloud) where there isn't necessarily a "main" database, but many databases, all however provided by the same database provided (Materials Cloud, in this case), exposing the same API, but managed independently by different researchers (as an analogue, imagine GitHub repositories, where each user can manage his own GitHub repository, but the API is provided for all DBs by GitHub itself).

While this could be achieved by implementing the current API for each sub-DB, this would require to register each of them with the current specification (that, at the bottom, lists DB providers and their URL) but this is not a viable approach for cases where the sub-DBs within one registered DB provider can be added and removed by users at any time.

This proposal has the following aims:

provide a minimal extension to the current API, supporting a minimal but useful usecase (having multiple DBs), and defer to the future further interesting extensions (see e.g. #22, that focuses more on discovering other DBs)
allow for the concept of 'default' DB for those who only have only one "default" sub-DB
it should require only minimal effort to implement (especially for those who only have only one "default" sub-DB) - the current addition can be implemented as a static JSON file by DBs that only have one (or a fixed number of) sub-DBs.

As side-effect:

this allows to easily define a "test" DB (that was discussed at the June 2018 Optimade meeting), see #44.

Proposed implementation

Implement the same API (the one already defined by OPTiMaDe) for each sub-DB, where each sub-DB has a different base_url, for instance:
(versions might be different for each DB - no requirement here). It is up to the implementation to decide the best approach to reuse the API code.
In the OPTiMaDe API specification Appendix, where now we list only a prefix with a generic name of the DB, e.g.:
```
_exmpl_: used for examples, not to be assigned to a real database
```
replace this with a common, well defined API endpoint, e.g. http://www.example.com/optimade-discovery/, like
```
_exmpl_: http://www.example.com/optimade-discovery/ (used for examples, not to be assigned to a real database)
```
whose format we describe below. Ideally, I think we should also add a machine-readable list for automated discovery.
The http://www.example.com/optimade-discovery/ returns generic information on the given DB provider, about the sub-DBs, and in the future can be also extended (this is e.g. a good place to add siblings as discussed in #22). Moreover it can also contain DB-specific extensions using the prefix as already possible in the OPTiMaDe API.
1. The format of the output of the discovery endpoint could be something like this (this is not meant to be an actual format specification, but only an example to discuss the API content - once we make this issue an accepted feature, we will discuss the format in more detail):
```
{
   'db_provider_name': 'Example',
   'db_provider_description': 'DB used for examples, not to be assigned to a real database',
   'db_provider_homepage': 'http://example.com',
   'db_provider_optimade_prefix': 'exmpl',
   'sub_databases': {
        "main": {
            "base_url": "http://example.com/mainDB/optimade/"
        },
        "test": {
            "base_url": "http://example.com/testDB/optimade/"
        },
        "third": {
            "base_url": "http://example.com/AThirdDB/optimade/"
        }
   },
   'default_sub_database': 'main',
   'optimade_test_sub_database': 'test',
}
```
- the default_sub_database is optional, should be provided only for DBs that want to have clients point by default to a given DB.
- the optimade_test_sub_database is optional and should be exposed only by those sub-databases that expose a test DB as discussed in #44.
- we can include additional metadata, e.g. the version of this information etc., if we think it's useful.
- I made each sub-DB a dictionary if in the future we want to have more metadata here (e.g. if in the future we want to have multiple test DBs of different formats, if we want to say if the sub-DB is open or requires authentication, if there is a suggested indication on rate-limiting for querying, etc.)
- Probably we will need to implement a small extension to my example above to support pagination of the sub-DBs.
- as noted above, this response can be hardcoded in a static JSON file if the list of sub-DBs does not change (quickly) over time.

Finally, additional side-effects:

this extension will make all DB-providers defined by OPTiMaDe completely automatically discoverable by a machine:
1. read the OPTiMaDe specs
2. check each "discovery" endpoint
3. check each sub-DB (or only the default one)
this extension will provide machine-readable metadata about the various DB-providers

giovannipizzi commented 6 years ago

If you agree on the main idea, I would add this to the 1.0 release (as this is essential for us) and we iterate via comments on the actual response format.

ltalirz commented 6 years ago

Sounds good to me!

Two minor comments:

I don't think the sub_ is needed, so I would replace sub_databases by databases everywhere
Perhaps this is already what you meant, but I believe all OPTiMaDe APIs should implement the discover endpoint (i.e. it is not just available on some special "central" ones). To make this more clear, your example url could read http://example.com/mainDB/optimade/v0.9/discover

rartino commented 6 years ago

This is nicely thought-out, and I like the careful thought on useful members in point 4.

However, is there a good reason not to do this inside the OPTIMaDe API itself, i.e., more in line with the 'children' mechanism in #22?

More specifically, one can follow the spirit of #22 but mimic this proposal as closely as possible by:

State that we should, in the OPTiMaDe API specification Appendix, specify one base url for each database. This is each database's chosen 'parent' base url.
Add all the info suggested in point 4 to the regular OPTIMaDe /info endpoint output. To simplify things, avoid the 'relationships' super-dictionary (which was suggested in #22 but probably isn't a good idea anyway due to stipulations on such a member from jsonapi), and rename 'children' -> 'sub_databases'.

This would lead to an /info format like this:

{
  "data": [
    {
      "type": "info",
      "id": "/",
      "attributes": {
        'db_provider_name': 'Example',
        'db_provider_description': 'DB used for examples, not to be assigned to a real database',
        'db_provider_homepage': 'http://example.com',
        'db_provider_optimade_prefix': 'exmpl',
        "api_version": "v0.9",
        "available_api_versions": {
          "0.9.5": "http://db.example.com/optimade/v0.9/",
          "0.9.2": "http://db.example.com/optimade/v0.9.2/",
          "1.0.2": "http://db.example.com/optimade/v1.0/"
        },
      },
      "sub_databases": {
         "main": {
             "base_url": "http://example.com/mainDB/optimade/"
         },
         "test": {
             "base_url": "http://example.com/testDB/optimade/"
         },
         "third": {
             "base_url": "http://example.com/AThirdDB/optimade/"
         }       
      }
      "formats": [
        "json",
        "xml"
      ],
     // ...
     }        
  ]
}

And we can then wait with, e.g., "siblings" etc., if that seems a bigger step at this time.

Isn't this more straightforward than to, technically, create another new (and non-jsonapi confirming) indirection API on top of the OPTIMaDe API?

(I'll add that I intentionally left out optimade_test_sub_database, but it can of course be included under attributes if it must. But, if this attribute only points at a single database being the test database, cannot that just be forced to always be the "test" key of sub_databases? Is there really a need for people to label a database "test" without it taking this meaning?)

giovannipizzi commented 6 years ago

Thanks @rartino @ltalirz for the comments.

To answer these points

I believe all OPTiMaDe APIs should implement the discover endpoint [...] i.e. your example url could read http://example.com/mainDB/optimade/v0.9/discover

Add all the info suggested in point 4 to the regular OPTIMaDe /info endpoint output.

I'm ok with both comments (and I agree that avoiding to define an additional API is better if possible), with the two following caveats (ok for us, but just for discussion):

this requires all DBs to implement the concept of multi-sub-DBs in the API, and each sub-db should know about all the siblings (for us, anyway, this wouldn't be a problem so I'm ok with it)
this might suggest that the 'main' DB is a 'meta' DB gathering content from all sub-DBs. As we discussed, this is potentially bad because a client looping over all clients might retrieve multiple times the same data. Probably in the case of no 'default' DB (our case) the 'main' DB should just be empty, even if a bit weird, and we should clearly state that in the current API version, one MUST NOT define meta-DBs, but only "real" DBs.

Moreover, this technically allows any level of nesting for sub-DBs (as a sub-DB can now expose further sub-DB). Is this something we really want to allow? (summarising: the difference between the two proposals are mainly:

my proposal does not require to "define" a main DB (but allows optionally)
my proposal describes the situation of a flat, 1-level-deep collection of DBs, while @rartino 's one describes a tree of DBs.

giovannipizzi commented 6 years ago

(I'll add that I intentionally left out optimade_test_sub_database, but it can of course be included under attributes if it must. But, if this attribute only points at a single database being the test database, cannot that just be forced to always be the "test" key of sub_databases? Is there really a need for people to label a database "test" without it taking this meaning?)

I agree to leave this out until we agree on #44. However, I don't think we should make restrictions on how the sub-DB namespace is used by each DB. Some motivations:

test might be used for some internal platform test, and there might be more APIs implemented, and our could be called optimade_test instead
if users are allowed to choose a DB name (think e.g. to GitHub, or pypi), someone might already have "taken" test before the implementation of this API
in the future, we might need to define more "types" of test-DBs (e.g. one for structures and a different one for a different data type, or different versions of the test DB with different types/amounts of data, or for future changes to the test DB to keep them versioned) so I would keep the info on the test sub-DB(s) explicit

ltalirz commented 6 years ago

this requires all DBs to implement the concept of multi-sub-DBs in the API,

The basic concept, yes. In order to reduce implementation effort, one could allow for a simplified form of the info/discover endpoint, if there is just one database to serve.

and each sub-db should know about all the siblings

Why? mainDB could know about the others, while all of the others could know just about themselves.

this might suggest that the 'main' DB is a 'meta' DB gathering content from all sub-DBs. As we discussed, this is potentially bad because a client looping over all clients might retrieve multiple times the same data.

Well, I see the meta DB as a service that you can choose to use or not. I.e. as a client

you can directly query the meta DB. This saves your client the task to query each of its sub-databases individually. In turn, you should not merge these results from queries of its sub-databases.
you can decide not to query the meta DB but query each of its sub-databases individually. This is more effort, but you are guaranteed to never query the same data twice

In logical terms, this problem is solved by adding a meta label to meta DBs that clients can use to make this decision (the 'leafs of the DB tree' would not have this label).

Still, it would be nice to talk to someone who knows a bit about distributed databases. I imagine this problem has already been solved, perhaps in a different way.

rartino commented 6 years ago

@ltalirz are there major benefits with a /discover endpoint over having this info as optional members under the /info endpoint? Advantages I see with /info:

Clients need only a single query to get all introspection info, rather than to crawl multiple endpoints.
Databases that do not care about sub databases can now completely ignore the concept, all that is needed is to not include any sub_database members in the /info response.
Simplicity of the API; all other things equal, the less endpoints the better.

@giovannipizzi

this requires all DBs to implement the concept of multi-sub-DBs in the API, and each sub-db should know about all the siblings (for us, anyway, this wouldn't be a problem so I'm ok with it)

Somewhat echoing @ltalirz, I don't see why. A DB that doesn't care will just leave out the optional members in /info?

On the other hand, with a separate indirection API, every client now needs to implement handling of being given both indirection urls and direct OPTIMaDe urls.

Let me also suggest as an alternative that we could say that the sub_databases member does NOT represent a parent/child relationship, but rather that all sub_databases are supposed to indicate all the other ones as sub_databases. I.e., providing horizontal discovery rather than a tree. This might lead to a nicer end-user UI, where it doesn't matter which OPTIMaDe base_url a user has found, all alternative databases can still be discovered and exposed to them. This also removes the need to regard one DB as the 'top' one (though, one still have to make a choice of which base_url to put in the spec appendix.)

this might suggest that the 'main' DB is a 'meta' DB gathering content from all sub-DBs.

I don't see a difference in 'suggestion' between the two alternatives? (i.e., indirection API vs. indirection within the OPTIMaDe API). But I think we should write in the specification that "API implementations MUST NOT use the subdatabases relationship to indicate that one base_url aggregates data found at other base_urls. A facility to indicate an aggreagation-type relationships may be added to the specification at a later point."

Moreover, this technically allows any level of nesting for sub-DBs (as a sub-DB can now expose further sub-DB). Is this something we really want to allow?

I would suggest to work around this in the specification: "While the facility to specify subdatabase relationships technically allows such relationships to arbitrary depth, client implementations do not have to expose such relationships beyond the first level."

my proposal does not require to "define" a main DB (but allows optionally)

Everything else aside, it seems any facility for sub DBs will need to do a trade-off between:

Add a new layer of indirection that all clients need to deal with - or -
Have clients pointed to a 'default' (though, possibly empty) database from which other databases can (optionally) be discovered.

I see drawbacks to both approaches, but in my mind the comparison comes out somewhat in favor of '2' because it promotes client simplicity.

Finally, I would also like to add to the proposal the suggestion that sub_databases has a mandatory member for database_id:

"sub_databases": {
         "main": {
             "base_url": "http://example.com/mainDB/optimade/",
             "database_id": "_exmpl_xuowd3883281"
         },

Here 'main' is supposed to be a user-friendly name (e.g., exposed to users in a UI), and database_id MUST start with a database-specific prefix and be a unique identifier of this database. This requirement seems a good idea to prepare for future sibling discovery, de-duplication, and aggregation features.

giovannipizzi commented 6 years ago

@rartino thanks. I agree with having a database_id.

Regarding or your very insightful description of the trade-off between:

Add a new layer of indirection that all clients need to deal with - or -

Have clients pointed to a 'default' (though, possibly empty) database from which other databases can (optionally) be discovered.

I agree, even if I would be tempted to lean to 1 instead, because it makes sure clients know about the concept. Otherwise clients might not implement this and just think the DB is empty. In any case, for a "dumb" client (that if dumb anyway will probably implement just a minimal, possibly incomplete version of the specs, and therefore might miss the concept of sub-DB), once it discovers that there is a default sub-DB, it can start using that one directly and forget about the discover endpoint altogether (or even more generally, the discover could be read by a human the first time, and then a simple client can just point to the interesting endpoint).

A note, also, in case that my proposal wasn't fully clear: my suggested discover API only consists of that single endpoint and does not require to define an additional, empty DB that depending on the implementation might need to be defined, managed etc (that I see as a disadvantage).

ltalirz commented 6 years ago

@rartino Concerning discover vs info: I'd be fine with either solution

@giovannipizzi

my suggested discover API only consists of that single endpoint and does not require to define an additional, empty DB that depending on the implementation might need to be defined, managed etc (that I see as a disadvantage).

I don't believe there would be the need for an "empty database".

I think we should separate the concept of the API from the database(s) it queries. I agree that there is a distinction between:

forwarding requests to other OPTiMaDe APIs
translating requests to an internal language understood by the DB and querying a DB

We should think about how the API decides what to do, and whether 2. should be a "one-to-one" relationship between API and DB (which I think is what you are currently assuming) or whether we allow "one-to-many" (including "one-to-zero") for 1. and 2.

rartino commented 6 years ago

It seems we have at least two concrete suggestions for v1.0 with different drawbacks? Maybe it is time to bring this discussion to the email list? I'm imagining sending out something like this:

At the workshop it was discussed that for some database providers (e.g., Materials Cloud) there isn't necessarily a "main" database, but many databases over, e.g., very distinct data domains or which are managed independently. There is some urgency to get this in place for OPTIMaDe v1.0 to avoid ad-hoc solutions. A more detailed discussion is present at github issue #46 (\<url>).

At this point two main ways to handle a discovery mechanism for subdatabases are discussed:

To say that when a client retrieves an OPTIMaDe base url, it may instread receive a very simple json-formatted "discovery" reply that points out the sub-databases. (Which often will be possible to implement as a static file.) In a sense this adds an extra separate "mini-API" for discovery.

- or -

That the already existing /info endpoint in the OPTIMaDe API is amended to provide the same type of discovery info.

Notes about option 1:

Client implementations will now be required to handle getting a discovery reply instead of finding a normal OPTIMaDe endpoint and need to act accordingly.
It may be a bit awkward to formulate within our otherwise jsonapi-conformat spec that in this specific situation one receives a 'static' non-jsonapi-conformant reply.

Notes about option 2:

This option enforces a "default" database, in that it is the one that the client first arrives at (which, technically, can be empty). Database providers may prefer not to designate a default database.
Since the feature of subdatabases is not as apparent as in option 1, naive client implementations may completely miss it and only give access to the default database.

At this point it would be good to get high-level feedback from this email list about which option is the better to pursue. For those who want to engage in a deeper detailed technical discussion about precisely how either option is implemented, it may be better to let that continue in github issue #46 (\<url>).

(I'm assuming the above summary shows some of my bias for option 2, so, since @giovannipizzi is the top poster, I'm fine with him sending out an edited or rewritten version of this with his own bias instead.)

giovannipizzi commented 6 years ago

Thanks! Sent, with a small change (as probably this wasn't clear from my explanation): in option 1, the DB must always provide the discover endpoint (that will be the one specified in the optimade specs), even if there is only 1 DB. So clients do not have to guess what type of endpoint it is.

If we don't want to enforce this level of indirection for the clients (that anyway, for DB providers with a single default DB, needs only one single request, and then the default base_url endpoint can be cached and used directly forever in the future, and the discover never used anymore - or even read by a human, and then used as the base_url for the client): then we can also specify in the specs an (optional) default endpoint (even if I wouldn't do so).

fawzi commented 6 years ago

Somehow I had in mind something different with subdatatabases. The discovery is a separate issue, but for the database section I was thinking of simply having another optional database=XX query parameter to basically all requests, and leaving it out would select the default database...

ltalirz commented 6 years ago

After discussing this issue with Giovanni, it turned out that we had quite different ideas of what "sub-databases" meant and what the role of the API in addressing those would be. In order to avoid misunderstandings, I'd like to try and summarize the result of our discussion as briefly and clearly as I can.

Proposal 1 - Keeping a list of APIs

In its essence, Giovanni's suggestion simply is: keep a list of existing APIs. A minimal implementation of this suggestion would just be a file hosted somewhere that contains a list of all known base URIs - no change of the spec needed, no additional effort whatsoever. Each of the APIs in the list has one database it queries, and it knows just about that one. All work involved in querying multiple databases is left to the client. Personally, I would not even speak of "sub-databases" in this context because in this scenario there is no hierarchy - all APIs and all databases are created equal.

I understand why Giovanni did not see the necessity to include it in the OPTiMaDe API spec - in the end, one file is enough (could be hosted in the optimade git repo).

Proposal 2 - APIs that can query multiple data sources (skyscanner)

What I was associating with the term sub-database (and what others might have associated with the "skyscannner" analogy) is something more elaborate: one API being able to query multiple databases, either a. querying directly multiple postgres/mongodb/... databases known to it or b. delegating queries to multiple other OPTiMaDe APIs known to it

This is the skyscanner model and it shifts work from the client to the API server. In this scenario, clearly the API needs to know about these databases and it makes sense to extend the specification to have the API tell which databases it is querying (e.g. to help clients avoid querying the same data source multiple times).

Personal conclusion

Proposal 1: I believe that keeping a list of existing APIs (possibly with some additional metadata, to be decided) can't be wrong - it does not need a modification of the OPTiMaDe API spec, all it needs is a consensus on the information stored in the list. In principle, a list of OPTiMaDe base URLs would be enough since their /info endpoints can then be queried for more information - but one can decide to include a little meta information as well.

Proposal 2: This does call for a modification of the spec (e.g. providing info about queried sub-databases in the /info endpoint as suggested by @rartino, or extending the query language in order to be able to restrict to certain subdatabases as suggested by @fawzi ). It also requires some thought on how cases a. and b. should be handled (whether one should impose restrictions etc.) and how they would be implemented. Perhaps this can be deferred until we have an actual implementation of this case (I believe Materials Cloud might want to provide such a service).

giovannipizzi commented 6 years ago

Thanks! I think this summarises very well what I had in mind, and I would vote for going for 1 now, and moving 2 to a different issue to be discussed and agreed upon later.

rartino commented 6 years ago

@ltalirz Indeed, what you describe as proposal 2 is what I've used the term "aggregation" for above and in the discussion so far I've viewed it mostly as a separate issue (though, I did propose to add to the spec that the subdatabase feature MUST NOT be used to indicate aggregation...) Lets agree on moving all discussion of this feature into a separate issue.

Nevertheless, this confusion perhaps indicates that 'subdatabase' is not the best term for what we are discussing.

@giovannipizzi

If we don't want to enforce this level of indirection for the clients (that anyway, for DB providers with a single default DB, needs only one single request, and then the default base_url endpoint can be cached and used directly forever in the future, and the discover never used anymore - or even read by a human, and then used as the base_url for the client): then we can also specify in the specs an (optional) default endpoint (even if I wouldn't do so).

In my mind we need to have client implementations aware of the indirection, and re-querying it to discover changes. That way I see uses for this feature in the interaction between clients and end users.

If we don't do that, what is the practical use for this feature, compared to just a list of your various base_urls at your website, in a format written for humans? (We can give the link to that page in the spec.)

@fawzi a "database=XX" query parameter is indeed a straightforward approach. (But, we still need a discovery feature.) However, for someone with subdatabases that uses different backends, it may not be ideal to be forced to serve them all from the same base_url. The other way around, with separate base_url:s there is nothing preventing you to serve them from one implementation (even, from one url using e.g. url rewriting.)

giovannipizzi commented 6 years ago

In my mind we need to have client implementations aware of the indirection, and re-querying it to discover changes. That way I see uses for this feature in the interaction between clients and end users.

Indeed, you are right and I agree. I also see clients being able to parse this indirection level. My point was that, if one wants to start with writing a simple client, this level can be implemented at a later stage, so this suggestion does not make writing a simple client more complex.

rartino commented 5 years ago

We had some continued discussion about this on an OPTIMaDe call after Giovanni left.

In my impression, @sauliusg, @fawzi and I were all skeptical towards what is essentially a separate API for database discovery. Rather, we went in the direction of wanting to extend OPTIMaDe with an endpoint /databases for this. It would then be clearly specified that this endpoint supports no filtering or similar, its only function is to return a full list of OPTIMaDe 'base urls', which then is the same 'static json' that @giovannipizzi is asking for. This design allows to re-frame the issue of having to provide an "empty database" as one can now rather phrase this to say that the initial URL simply points to a database that only contains entries under /databases, but no structures or computations.

However, after leaving the call I thought some more about this. Specifically, what will I do in my setup the day I decide to provide multiple databases. With the "empty database" or "database of databases" solutions, I may need to suddenly introduce a new level of indirection, which breaks e.g. scripts of my users who previously have just connected to my base url and queried structures. Now that will suddenly return an error, and they have to discover that they need to go down one level in the hierarchy.

With @giovannipizzi's solution everyone is forced into a one-level URL indirection that makes it trivial to branch out into multiple databases at a later point. Maybe this forced indirection is actually a very good thing to avoid lots of breakage across database providers, who otherwise are likely to go through an evolution of "oops, now we need another level of indirection here".

sauliusg commented 5 years ago

Dear Rickard,

thank you for your comments!

My short notes to some points below:

On 2018-12-20 02:56, Rickard Armiento wrote:

However, after leaving the call I thought some more about this. Specifically, what will I do in my setup the day I decide to provide multiple databases. With the "empty database" or "database of databases" solutions, I may need to suddenly introduce a new level of indirection, which breaks e.g. scripts of my users who previously have just connected to my base url and queried structures. Now that will suddenly return an error, and they have to discover that they need to go down one level in the hierarchy.

That needs not to be so:

If OPTiMaDe API permits "/databases" endpoint, then any database can provide a list of databases and/or endpoints. So, your original database provides serving its contents as it used to provide (so all users' scripts keep working), and starts providing a list of new databases under "/databases" endpoint (that list was originally empty), so that users can discover dynamically new databases and endpoints.

Compatibility backwards is very important, nobody wants to change their software each time upstream breaks something, for sure :)

With @giovannipizzi https://github.com/giovannipizzi's solution everyone is forced into a one-level URL indirection that makes it trivial to branch out into multiple databases at a later point. Maybe this forced indirection is actually a very good thing to avoid lots of breakage across database providers, who otherwise are likely to go through an evolution of "oops, now we need another level of indirection here".

Forcing to do something (or, rather, restricting choice without necessity) is a Bad Thing (TM). One day we run into contradicting requirements, and have no good way to address them. Too bad.

Since OPTiMaDe REST describes several "REST endpoints", there is no contradiction to have also "/databases" endpoint to list all available databases, and "/endpoints" endpoint to provide the list of REST endpoints in a database; including itself :), the endpoint "/endpoints" (see how nicely recursive and self-describing it is!) .

In this case, MaterialsCloud is free to provide ".../materialscloud/optimade/v1.0/databases" endpoint (EP) to list all databases, in JSON or XML or whatever is in the specification, under a specified schema. If ".../materialscloud/db1/" is one such database, then ".../materialscloud/db1/optimade/info" would be its mandatory info endpoint, and ".../materialscloud/db1/optimade/endponts" would list all EPs that can be querried; e.g. if ".../materialscloud/db1/optimade/structures" is one of such endpoints, I can submit a query ".../materialscloud/db1/optimade/structures?filter=elements="Si,C"+AND+nelements=2"to get all silicon carbide structures, for instance.

COD, however, would probably not have a dedicated "https://crystallography.net/optimade/databases" EP, but instead provide identical OPTiMaDe interfaces for all our databases: "https://crystallography.net/optimade/cod/databases", "https://crystallography.net/optimade/tcod/databases", "https://crystallography.net/optimade/pcod/databases", etc.

All this is fully automatable. It solves interface problems, if not sematical ones, IMHO.

-- Dr. Saulius Gražulis Vilnius University Institute of Biotechnology, Saulėtekio al. 7 LT-10257 Vilnius, Lietuva (Lithuania) fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353 mobile: (+370-684)-49802, (+370-614)-36366

sauliusg commented 5 years ago

On 2018-12-20 02:56, Rickard Armiento wrote:

Maybe this forced indirection is actually a very good thing to avoid lots of breakage across database providers, who otherwise are likely to go through an evolution of "oops, now we need another level of indirection here".

Thas why I would advocate for "/endponts" endpoint, rather than for "/databases" (although the latter is also useful). You would query "/endpoints", and get a list of all queryable OPTiMaDe endpoints on this site. If you want, you can recursively descend deeper if "/endpoints" is among the endpoints that were returned :)

giovannipizzi commented 5 years ago

@sauliusg the point here is that we also want to do what you do:

provide identical OPTiMaDe interfaces for all our databases:

"https://crystallography.net/optimade/cod/databases",

"https://crystallography.net/optimade/tcod/databases",

"https://crystallography.net/optimade/pcod/databases",

i.e. we want to provide identical OPTiMaDe interfaces for all our databases:

The point is that you registered each of them already in the specifications:

_cod_: crystallography open database
_pcod_: predicted crystallography open database
_tcod_: theoretical crystallography open database

because in your case you have finite list and you don't expect to create a new database very often, and in that case it's ok to go back and make a change/PR to the specs.

In our case a new DB (think e.g. to a GitHub repo) is created by a user at any time, and we cannot register each of them in the optimade specs every time.

To contextualize in the case of COD/TCOD/..: would you be OK with registering JUST the following?

_cryst_: databases provided by crystallography.net

and then provide the list of the existing ones within the API you suggest?

When users go directly to the _cryst_ endpoint, and they find a default empty database for instance, I think this would be unexpected. Or you could pick the COD as the 'default' database (and maybe in your case it's ok) but then the PCOD, TCOD databases are very hard to reach if you don't know they exist and that you have to look for them.

rartino commented 5 years ago

Is there a reason not to close this issue now that a version of what was discussed has been merged?