gbif / gbif-api

GBIF API
Apache License 2.0
27 stars 5 forks source link

Registry API dataset endpoint has degraded pagination performance #118

Closed nickynicolson closed 5 months ago

nickynicolson commented 5 months ago

I need to access metadata for all datasets using the registry API, so I have to paginate through subsets of the data at most 1000 records at a time. High offsets seem to lead to degraded performance:

import logging, requests, timeit
timeit.timeit('_ = requests.get("https://api.gbif.org/v1/dataset?limit=1000&offset=0")', 'import requests', number=2)
9.584714983124286
timeit.timeit('_ = requests.get("https://api.gbif.org/v1/dataset?limit=1000&offset=50000")', 'import requests', number=2)
243.2884748019278
mdoering commented 5 months ago

You might be better off with the CSV export routine for all datasets: https://techdocs.gbif.org/en/openapi/v1/registry#/Datasets/searchDatasetsExport

MattBlissett commented 5 months ago

It's currently fine:

for o in `seq 0 1000 92199`; do echo -n $o ' ' && time curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null; done
0  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.04s system 2% cpu 6.641 total
1000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.12s user 0.02s system 2% cpu 6.375 total
2000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.02s system 1% cpu 6.628 total
3000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.12s user 0.03s system 2% cpu 6.660 total
4000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.01s system 1% cpu 6.505 total
5000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.03s system 2% cpu 6.414 total
6000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.09s user 0.04s system 0% cpu 13.951 total
7000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.02s system 1% cpu 6.237 total
8000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.02s system 2% cpu 5.933 total
9000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.01s system 1% cpu 6.092 total
10000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.02s system 2% cpu 6.038 total
11000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.02s system 2% cpu 6.704 total
12000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.05s system 2% cpu 6.826 total
13000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.02s system 2% cpu 6.416 total
14000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.09s user 0.03s system 2% cpu 6.207 total
15000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.02s system 2% cpu 6.463 total
16000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.09s user 0.03s system 0% cpu 13.268 total
17000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.03s system 1% cpu 6.818 total
18000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.02s system 1% cpu 6.701 total
19000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.02s system 1% cpu 6.611 total
20000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.03s system 2% cpu 6.636 total
21000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.09s user 0.04s system 1% cpu 7.097 total
22000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.01s system 1% cpu 6.517 total
23000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.03s system 2% cpu 6.735 total
24000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.03s system 1% cpu 6.420 total
25000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.12s user 0.04s system 1% cpu 14.944 total
26000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.04s system 1% cpu 6.783 total
27000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.12s user 0.01s system 1% cpu 6.766 total
28000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.02s system 1% cpu 6.426 total
29000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.02s system 1% cpu 6.626 total
30000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.03s system 1% cpu 6.672 total
31000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.09s user 0.03s system 1% cpu 6.645 total
32000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.09s user 0.03s system 1% cpu 9.194 total
33000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.02s system 1% cpu 6.769 total
34000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.03s system 1% cpu 10.996 total
35000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.03s system 1% cpu 6.696 total
36000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.02s system 1% cpu 6.588 total
37000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.03s system 1% cpu 6.877 total
38000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.12s user 0.02s system 2% cpu 6.648 total
39000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.17s user 0.05s system 2% cpu 8.362 total
40000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.12s user 0.00s system 1% cpu 6.384 total
41000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.02s system 2% cpu 6.628 total
42000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.09s user 0.03s system 1% cpu 6.457 total
43000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.02s system 0% cpu 14.244 total
44000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.03s system 1% cpu 7.449 total
45000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.02s system 1% cpu 6.220 total
46000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.12s user 0.02s system 2% cpu 6.310 total
47000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.09s user 0.03s system 1% cpu 7.631 total
48000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.03s system 2% cpu 6.442 total
49000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.12s user 0.02s system 2% cpu 6.486 total
50000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.02s system 2% cpu 6.544 total
51000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.03s system 1% cpu 6.849 total
52000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.15s user 0.03s system 1% cpu 14.946 total
53000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.02s system 1% cpu 7.081 total
54000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.04s system 1% cpu 8.237 total
55000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.02s system 1% cpu 7.450 total
56000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.03s system 1% cpu 6.685 total
57000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.12s user 0.02s system 2% cpu 6.606 total
58000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.02s system 1% cpu 6.378 total
59000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.08s user 0.02s system 1% cpu 6.226 total
60000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.03s system 2% cpu 6.165 total
61000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.09s user 0.04s system 1% cpu 10.039 total
62000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.02s system 1% cpu 7.063 total
63000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.40s user 0.14s system 4% cpu 13.401 total
64000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.02s system 1% cpu 6.881 total
65000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.08s user 0.03s system 1% cpu 6.454 total
66000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.09s user 0.02s system 1% cpu 6.452 total
67000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.09s user 0.03s system 1% cpu 6.663 total
68000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.02s system 1% cpu 6.446 total
69000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.03s system 0% cpu 14.656 total
70000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.01s system 1% cpu 6.750 total
71000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.02s system 1% cpu 6.409 total
72000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.09s user 0.02s system 1% cpu 6.173 total
73000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.02s system 1% cpu 5.992 total
74000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.02s system 2% cpu 6.153 total
75000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.01s system 1% cpu 6.474 total
76000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.09s user 0.02s system 1% cpu 6.228 total
77000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.02s system 1% cpu 6.116 total
78000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.02s system 0% cpu 13.136 total
79000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.12s user 0.02s system 1% cpu 7.507 total
80000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.02s system 1% cpu 6.752 total
81000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.01s system 1% cpu 6.215 total
82000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.09s user 0.03s system 1% cpu 6.682 total
83000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.26s user 0.09s system 3% cpu 10.529 total
84000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.09s user 0.03s system 1% cpu 6.195 total
85000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.01s system 1% cpu 6.590 total
86000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.03s system 2% cpu 6.128 total
87000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.12s user 0.02s system 0% cpu 13.680 total
88000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.03s system 2% cpu 5.970 total
89000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.11s user 0.02s system 2% cpu 5.887 total
90000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.09s user 0.02s system 2% cpu 5.550 total
91000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.10s user 0.03s system 2% cpu 6.027 total
92000  curl -Ss 'https://api.gbif.org/v1/dataset?limit=1000&offset='$o > /dev/null  0.08s user 0.01s system 5% cpu 1.796 total

but maybe we should change our monitoring for this query to look at a high offset, rather than the first page.

nickynicolson commented 5 months ago

but maybe we should change our monitoring for this query to look at a high offset, rather than the first page.

Yes, that sounds sensible

MattBlissett commented 5 months ago

Done (in a private repository).