CatalogueOfLife / backend

Complete backend of COL ChecklistBank
Apache License 2.0
15 stars 11 forks source link

api login & download #1075

Closed dgasl closed 2 years ago

dgasl commented 2 years ago

I can login to https://data.catalogueoflife.org/ with my gbif credentials and then download a given dataset. But how can I programmatically login to the api to do the same?

I saw a user/login request but it takes no parameters.

Also, I am not sure the export options are working or not. I tried both GET and POST export methods using the api web interface but they failed.

GET: https://api.catalogueoflife.org/dataset/3/export (I just passed mandatory dataset key through input box, and left the other default options untouched) Error: TypeError: Failed to execute 'fetch' on 'Window': Request with GET/HEAD method cannot have body.

POST: https://api.catalogueoflife.org/dataset/3/export I edited body to match dataset key (odd to me that it is passed in both body and url). REQUEST BODY:

  "datasetKey": 3,
  "format": "DWCA",
  "excel": true,
  "root": {
    "id": "string",
    "name": "string",
    "authorship": "string",
    "phrase": "string",
    "rank": "DOMAIN",
    "code": "BACTERIAL",
    "status": "ACCEPTED",
    "parent": "string",
    "label": "string",
    "labelHtml": "string"
  },
  "synonyms": true,
  "bareNames": true,
  "minRank": "DOMAIN",
  "force": true
} 

Error: response status is 400 RESPONSE BODY:

{
  "code": 400,
  "message": "Unable to process JSON"
}
mdoering commented 2 years ago

You can authenticate either with Basic Auth or JSON Web Token (JWT). If you inspect our UI calls, you will see it uses JWT. To get a token you have to use the /login route of the API with BasicAuth. Using BasicAuth for every request is probably the simplest. Just make sure to use https.

A POST against https://api.catalogueoflife.org/dataset/{DATASET_KEY}/export is indeed the way to request a new export. The ExportRequest object you are submitting does not have to have the datasetKey as its given in the URL already. In your example above the JSON is not valid, its missing the opening bracket. It also makes no sense to post the default values with string for example. Here is a simple request using curl:

curl -s --user username:password -X POST -H "Content-Type: application/json" --data-binary @request.json "https://api.catalogueoflife.org/dataset/2349/export"

The file request.json could look like this to download Chordata (ID=CH2) from the latest release of the COL checklist (datasetKey= 2349) in ColDP:

{
  "format": "COLDP",
  "root":  {
    "id": "CH2"
  }
}

Instead of supplying the explicit datasetKey for the latest release you should be able to use 3LR instead, which means LatestRelease (LR) of dataset 3 which is the master project for COL, but which uses temporary keys and might not always be in a consistent state, it is the working draft. There seems to be sth wrong when using 3LR in the export POST requests though, I get a 404. It works fine with other requests, eg. https://api.catalogueoflife.org/dataset/3LR/taxon/CH2

The export request returns the key of the download that is being prepared. It is async, so you need to wait for the email or poll the API to see if your download is ready, for example:

To download (it will use a redirect, hence the -L flag): curl -s -L -o download.zip "https://api.catalogueoflife.org/export/5daea8ca-14aa-4ee6-b2ee-8be36653d239"

This returns your request metadata: curl -s -H "Accept: application/json" "https://api.catalogueoflife.org/export/5daea8ca-14aa-4ee6-b2ee-8be36653d239"

dgasl commented 2 years ago

Thanks for your prompt answer @mdoering (I guess #1004 is related to this)

You can authenticate either with Basic Auth or JSON Web Token (JWT).

I have only used gbif api in the past (unauthenticated) so that was not trivial to me. Couldn't find any links to login information in COL api web interface. Would this be the appropiate documentation to look at? https://swagger.io/docs/specification/authentication/basic-authentication/

A POST against https://api.catalogueoflife.org/dataset/{DATASET_KEY}/export is indeed the way to request a new export.

But the api shows a GET method, too. ¿? That confused me.

The ExportRequest object you are submitting does not have to have the datasetKey as its given in the URL already. In your example above the JSON is not valid, its missing the opening bracket. It also makes no sense to post the default values with string for example.

Understood. I suggest to change a bit the example, so it shows some default values that work when user press "execute". Why does the example body request show a datasetKey=0, if it has to be passed in the URL?

I used the api web interface (so the opening bracket was in the request: I must have had a copy/paste mistake when I wrote the issue). But I still think something is wrong with the api web interface:

I first used login button (the open locker image got closed):

image

Then I tried your example request:

image

But I got a 400 response:

image


Should I better stop using the api web interface and try your curl examples directly? I'd prefer to use Python requests library (EDIT: its documentation and this example using sessions show how to do it).

How much time do you recommend to make the script wait before 1st trying to download the export? How much time between further tries? Will these download files be deleted soon?

I wonder if for full datasets (no root taxon) you have stable download urls.

Thanks a lot

mdoering commented 2 years ago

Hi, the API documentation is very basic and 100% auto generated at this stage. We are fully aware this is not enough to clearly document its use. But as the API is still changing in some areas and we have very limited resources we need to postpone a better documentation at this stage. Please enroll in our user mailinglist which is the best place to ask for help on API use.

Currently we still keep all export files, but this cannot go on forever. I would imagine we will remove downloads after a month in the future by default, but will provide a way to request longer archival together with a DOI for that file.

For full dataset downloads there is indeed a simpler way to request a download which is the GET request. For immutable COL checklist releases these are preprepared and readily available. For other datasets this is not the case.

The optional query parameter format=COLDP indicates the format to be returned. If no format is given this will be the original archive used for importing as it was submitted. This does not exist for COL obviously as it is assembled. But for external datasets like FishBase you can request the original data file that has been successfully imported.

mdoering commented 2 years ago

And yes, please use curl or python requests for your trials. The swagger interface is nice for playing with simple GET requests, but not really for posting and authentication.

dgasl commented 2 years ago

Please enroll in our user mailinglist which is the best place to ask for help on API use.

Thanks a lot @mdoering ... I did so (yesterday, before opening the issue). I still haven't got an answer email, but I thought it would be OK to use github anyway, so other users could be looking for the same info here as well.

For full dataset downloads there is indeed a simpler way to request a download which is the GET request. For immutable COL checklist releases these are preprepared and readily available. For other datasets this is not the case.

Good to know. As you mentioned the different formats available (when using POST) ... which default format would be returned when using a GET dataset/{id}/export request?

If no format is given this will be the original archive used for importing as it was submitted.

When using the dataset download web interface, I see some datasets (i.e. FishBase you mentioned) offer two different "Prepared downloads" links (i.e. "original archive" / "external source archive"). What's their difference and the api way to request one or the other?

I'm also wondering ... is there a way to request a specific version of a dataset? (i.e., to compare what has changed between two releases, for example). Following FishBase example "about" page:

Created: November 20th 2019, 11:07:46 am by markus
Modified: July 27th 2021, 9:37:22 am by markus
mdoering commented 2 years ago

we only keep the latest version, so you cannot request older versions. Releases of the COL checklist we keep as different datasets, so there you actually can.

dgasl commented 2 years ago

Thanks for the info about versions.

Can you tell me about the formats? (I still can't ask to the mailing list: I subscribed twice but for some reason I am not getting any email answers -I guess somebody has to read and aprove requests after weekend-)

mdoering commented 2 years ago

If no format is given this will be the original archive used for importing as it was provided by the source