CatalogueOfLife / general

The Catalogue of Life
49 stars 5 forks source link

Require user accounts for API users #74

Closed mdoering closed 4 years ago

mdoering commented 4 years ago

Currently the CoL API is planned to be mostly public and open for anonymous users. Similar to GBIF only bulk downloads of data are to require user accounts.

But should the CoL maybe require user accounts for all API use? This would allow us to track API usage better and provide a simple way of blocking misbehaving users. Many if not most APIs require users to register before they can be used, so it is not exotic.

It also provides us with a list of active users that we can keep in touch with, e.g. about API changes and planned outages

dremsen commented 4 years ago

I think it’s a good idea. It helps identify what specific user needs are and to refine services to support them. It provides a channel of communication to users who might have business dependencies on COL services in the event upgrades or changes might impact their usage. This line of communication might also be used to solicit agreements to share some of their usage information to GSD providers for their own reporting needs. It would provide the COL Secretariat with more refined usage demographics as well. Lastly, it might support wider use of the API which, in turn, gives us a much more clear overall picture of how the COL is used as we can add our own specific logging functions into the service methods themselves.

mdoering commented 4 years ago

I would like to raise this question again and address it before we go public with any API. The documentation of the old API suggests that this was already desired to happen in the old codebase: http://www.catalogueoflife.org/webservices/

When calling our web services, your application must provide a key as part of the service URL. You can obtain a key by filling out the form below. Anyone can make use of the web services provided by the Catalogue of Life.

We need at least the domain from which you will be calling our services. The key will work for the domain you enter and all its subdomains. You can optionally provide us with an email address, so which can inform you about upcoming changes in the API, server malfunctions, etc. etc.

I would support the idea of requiring registered users to access the API. Anyone who thinks differently please raise your voice now!

yroskov commented 4 years ago

I am strongly staying for registration of API users. Registration also means CoL obligation to provide support to clients.

mdoering commented 4 years ago

@sckott do you see any problem for your client(s) with requiring authentication for all API requests?

ThierryBourgoin commented 4 years ago

I agree Markus. Yes we should keep some traces of who is using CoL. Beside control in case of some problems, and because CoL stands in the backyard, often hidden, of many other global or non global initiatives or even individual projects, keeping such information is absolutly necessary to collect in some way this information. I'm even in favor of adding a non-obligatory field (a tick menu and/or free field) to document the main interest domain of the user.

dremsen commented 4 years ago

I still support the idea too! API access implies a degree of integration and dependency that should be seen as a business relationship. We should establish this to better support and communicate with users as well as improve our understanding of how the services are used and by whom.


Director, Marine Research Services /Staff Scientist | Marine Biological Laboratory, MRC 316 | 7 MBL Street, Woods Hole, MA 02543 USA Chair, Information Systems Group, Catalogue of Life, Naturalis, Leiden, NL www.CatalogueOfLife.orghttp://www.CatalogueOfLife.org dremsen@mbl.edumailto:dremsen@mbl.edu p: 508-289-7477 c: 508-274-4055 f: 508 289 7905 Skype: dremsen

On Apr 15, 2020, at 09:31 AM, Markus Döring notifications@github.com<mailto:notifications@github.com> wrote:

I would like to raise this question again and address it before we go public with any API. The documentation of the old API suggests that this was already desired to happen in the old codebase: http://www.catalogueoflife.org/webservices/

When calling our web services, your application must provide a key as part of the service URL. You can obtain a key by filling out the form below. Anyone can make use of the web services provided by the Catalogue of Life.

We need at least the domain from which you will be calling our services. The key will work for the domain you enter and all its subdomains. You can optionally provide us with an email address, so which can inform you about upcoming changes in the API, server malfunctions, etc. etc.

I would support the idea of requiring registered users to access the API. Anyone who thinks differently please raise your voice now!

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/CatalogueOfLife/general/issues/74#issuecomment-614042085, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAAPITGV6BOIXP5J4X7ALE3RMWZJZANCNFSM4LPLXWCQ.

sckott commented 4 years ago

R users are not the most technical, so some users do struggle with API keys/etc., but with enough documentation users do figure it out. The main use case with this data will be in the taxize package, where there's a handful of other data sources that require keys (EOL used to, IUCN Red List, Tropicos), so most users should be familiar with the concept at least. Might there be a way to programatically get authentication tokens/keys? Or will users need to use a browser flow?

I imagine this will make rate limiting easier if you plan to do it - will there be response headers that will give rate limit information?

mjy commented 4 years ago

I love the idea of programmatically grabbing a key via a simple mechanism and building this into wrapping software like taxize.

The barrier to getting a key should be SUPER low. I just did this for Dropbox, for example, and it was pretty painless.

On Wed, Apr 15, 2020 at 10:56 AM Scott Chamberlain notifications@github.com wrote:

R users are not the most technical, so some users do struggle with API keys/etc., but with enough documentation users do figure it out. The main use case with this data will be in the taxize package, where there's a handful of other data sources that require keys (EOL used to, IUCN Red List, Tropicos), so most users should be familiar with the concept at least. Might there be a way to programatically get authentication tokens/keys? Or will users need to use a browser flow?

I imagine this will make rate limiting easier if you plan to do it - will there be response headers that will give rate limit information?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CatalogueOfLife/general/issues/74#issuecomment-614124463, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAFQSAWXGO4DW6XEVT5NRDRMXKJLANCNFSM4LPLXWCQ .

mdoering commented 4 years ago

The current implementation requires a GBIF account which is shared to also log into CoL. And authentication then is either basic auth over SSL or via a JSON Web Token you can obtain programmatically using the GBIF creds. Either of them requires header based authentication. I don't think you can create a GBIF account purely programmatically at this stage.

Rate limiting is not on the agenda for now, but this obviously helps in case we would need to.

mdoering commented 4 years ago

It would be easy to alternatively allow the JWToken to be given as a query parameter if that helps. In general these tokens should be transient, we expire them after a week currently.

sckott commented 4 years ago

From a security perspective that's good news about the JWT expiring after 1 week. I'll make it as easy as possible for my users to refresh their JWT

mdoering commented 4 years ago

I just added the option to also pass the JWT via a ?token=XYZ query parameter

timrobertson100 commented 4 years ago

I think we might need to consider the types of users of the API and the implications of both Auth and time-bound token renewal. Some scenarios to consider:

  1. A website/tool (e.g. google refine, sheets etc) using e.g. a COL+ autocomplete service for entering scientific names
  2. A researcher scripting a one-time analysis using R
  3. An organisation running batch jobs to update keys in their database

For the case of 1. it means the user of the tool either needs to authenticate (that may not be possible depending on the tool), or users all use the same token (so can't expire), or there needs to be something like a proxy service. Running a proxy service would circumvent the goal, by exposing the API without Auth.

For the case of 2., it's reasonable to ask users to provide a key, that may even expire as long as it is easy.

For the case of 3. time expiring tickets will require some kind of ticket renewing service which is an unnecessary technical hurdle and less secure (the ticket renewing service needs the account credentials rather than just a token which can be revoked). I'd advise non-expiring keys for this scenario.

I don't envisage GBIF restricting access to its API for species browse / search / autocomplete etc.

I need to give this more thought, but how would Auth work in a linked open data world?

How will Ajax calls work for users on the COL+ website? Will they need to log in, or would you provide time-bound tokens (which has caching implication)?

MortenHofft commented 4 years ago

I don't know anything about the need or desire to require authentication. But if you decide to do so, then how about separating

App key GBIF would have one or more app keys when using COL. Anyone else building a public project, should also use an app key. The keys will in many cases be public (e.g. in JS code) and likely used by many individuals. Could be tied to a domain. The API key could still be a JWT, without an expiration date, but could encode domain, project name etc. Who knows, there might even be more options on some keys (say GSDs are allowed to do expensive metric queries).

Authentication token e.g. personal JWT, basic auth Used to authenticate me as a user and should not be made public as it allows others to edit and access my account info. This is fine for running a scripts.

MortenHofft commented 4 years ago

Github strikes a balance by allowing anonymous API calls, but throttle them and have time window limits on usage. If you login, then you are allowed more nested queries, more queries in a row etc.

MattBlissett commented 4 years ago

The only time authentication is required by GBIF is for requesting a download.

Otherwise, it's very convenient to be able to query GBIF without any authentication: http://api.gbif.org/v1/species/104070564 — or iNaturalist: http://api.inaturalist.org/v1/observations/9448606 — or ZooBank: http://zoobank.org/NomenclaturalActs.json/6EA8BB2A-A57B-47C1-953E-042D8CD8E0E2

People use these APIs for R/Python/etc notebooks, with OpenRefine, on many biodiversity websites, and in training courses. The barrier to entry is extremely low.

We don't measure our success by the usage of the API (last 24 hours: 40 R users, 15 Python, 17 Ruby, plus some Java and obvious website integrations like Drupal, PHP). I'm sure registration would allow for more detailed statistics — but are they really much more valuable than what can be done based on IP address and user agent alone? And are they relevant, compared with measuring scientific results, like citations?

Many if not most APIs require users to register before they can be used, so it is not exotic.

Most APIs are run as commercial services. Biodiversity APIs generally seem to be available without registration (GBIF, iNaturalist, IPNI [and most of Kew], ZooBank). Public sector data for public transport or weather is also available without registration. I don't think registration sends a good message — it's like "free" software that requires a login to download, followed by unsubscribing from email.

Github strikes a balance by allowing anonymous API calls, but throttles them and has time window limits on usage. If you login, then you are allowed more nested queries, more queries in a row etc.

That seems reasonable, although it's more work to implement.

It helps identify what specific user needs are and to refine services to support them. It provides a channel of communication to users who might have business dependencies on COL services in the event upgrades or changes might impact their usage.

The are plenty of other channels that work here — mailing lists, Twitter, a status API, email.

Lastly, it might support wider use of the API which

I really don't see how requiring registration would encourage use. Quite the opposite, I think.

I think it's essential to get feedback for this from typical users, and not decided based on the few developers etc that see this issue.

mjy commented 4 years ago

On Thu, Apr 16, 2020 at 6:22 AM Matt Blissett notifications@github.com wrote:

The only time authentication is required by GBIF is for requesting a download.

Otherwise, it's very convenient to be able to query GBIF without any authentication: http://api.gbif.org/v1/species/104070564 — or iNaturalist: http://api.inaturalist.org/v1/observations/9448606 — or ZooBank: http://zoobank.org/NomenclaturalActs.json/6EA8BB2A-A57B-47C1-953E-042D8CD8E0E2

People use these APIs for R/Python/etc notebooks, with OpenRefine, on many biodiversity websites, and in training courses. The barrier to entry is extremely low.

IMO this should be the centeral "dogma". I feel it is extremely important to err on the side of using the data as opposed to, say, citing the data, or worrying about abusers. With regards to the former- proper citation is the something that needs to be taught to students by teachers, not something to put up barriers (that won't work btw) to. You can not enforce citation, you can only teach it. With regards to the latter the key problem is DDOS (maybe?), and if that can be mitigated on the back end, when it is rare, then perhaps you have no reason for other types of authentication at all. Furthermore I suspect some people need to be able to anonymously examine data. How comfortable would you be if you had to be traced when doing a sensitive analysis, prior to publication (and is this even legal on a global scale)?

In otherwords, I'd propose to keep it simple, make tech work for you (DDOS dection, simple APIs), and teach your students.

mdoering commented 4 years ago

Contrary to the GBIF API many parts in the CoL API are actually writing data and these definitely need proper authorization. So for some purposes people will definitely have to register.

But for searching/matching/resolving species names or other read operations registration is obviously not strictly needed. Proving a simple app token as a query parameter should be a really low barrier though for anyone in any language or tool.

Knowing your users better and being able to contact, throttle or block them in case of mis- or too heavy use is useful. Malicious users would probably just register new accounts, so blocking by IP ranges or other means is likely also needed in such cases, so registration will not solve this alone.

The issue mostly came up because: a) the CoL has very little knowledge about its users and hardly any knowledge about its current API users. b) data providers for the CoL frequently asked for better usage reports that needs to include API usage, not just analytics for the website. c) we need and already have user registration and authorization for writes

IUCN, BHL, ARKive, GeoNames and Artdatenbanken are BioDiversity examples of APIs that require registration. It is not uncommon.

Quite a few people (also outside this thread) seem to fear we increase complexity and lose users by requiring registration. I don't really share that opinion, but I guess we can always change our policy and start out with a truely open API initially.

orrellt commented 4 years ago

Hey Markus - and all. I was one of those that commented outside this thread.
I hadn't considered API services were someone would, "actually writing data and these definitely need proper authorization."

I agree for someone that is writing data back into CoL+, registration would be necessary.

mdoering commented 4 years ago

I have collected a few more arguments why basic reads should be open and accessible without any authentication:

Together with the fact that it is aways simpler to use an API without bothering about authentication we should not mandate user registration to use the API. Only certain advanced aspects like writing to the API and bulk download will make user accounts required.