gbif / occurrence

Occurrence store, download, search
Apache License 2.0
22 stars 15 forks source link

Adapt the Occurrence API to use work with multiple taxonomies #342

Open fmendezh opened 8 months ago

fmendezh commented 8 months ago

Draft proposal of changes to the Occurrence API to support: filter by checklist, add additional ranks, full classification, checklists dataset key, data types changes.

Decide if this change can be applied to API v1 or if a V2 is needed.

djtfmartin commented 2 months ago

Draft proposal for Occurrence API changes

Below is a set additional webservices and additional parameter options for existing web services. These changes are intended to be additions to the existing v1 API and backwards compatible i.e. no breaking changes although they include additions to the output format, but leaving existing fields and nested structures in place.


Search response - including multiple classifications

The search response includes a classifications array, which contains 0..n classifications associated with the occurrence record. Example json below (shortened for brevity). The existing gbifClassification will remain in place with integer keys.

{
    "offset": 0,
    "limit": 20,
    "endOfRecords": false,
    "count": 412833,
    "results": [
        {
            "key": 462028,
            "datasetKey": "9bd520e3-00fa-4955-a554-924ea440862c",
            "publishingOrgKey": "d2b97690-bfd6-11de-b279-d52977ace833",
            "installationKey": "99672740-f762-11e1-a439-00145eb45e9a",
            "hostingOrganizationKey": "d2b97690-bfd6-11de-b279-d52977ace833",
            "publishingCountry": "IE",
            "protocol": "DWC_ARCHIVE",
            "lastCrawled": "2024-09-05T18:36:01.493+00:00",
            "lastParsed": "2024-09-12T14:10:38.809+00:00",
            "crawlId": 176,
            "extensions": {},
            "basisOfRecord": "HUMAN_OBSERVATION",
            "occurrenceStatus": "PRESENT",
            "sex": "MALE",
            "lifeStage": "Adult",
            "classifications": [
                {
                    "datasetKey": "7ddf754f-d193-4cc9-b351-99906754a03b",
                    "usage": {
                        "key": "8C2QW",
                        "name": "Episyrphus (Episyrphus) balteatus (De Geer, 1776)",
                        "rank": "SPECIES"
                    },
                    "acceptedUsage": {
                        "key": "8C2QW",
                        "name": "Episyrphus (Episyrphus) balteatus (De Geer, 1776)",
                        "rank": "SPECIES"
                    },
                    "classification": [
                        {
                            "key": "RT",
                            "name": "Arthropoda",
                            "rank": "PHYLUM"
                        },
                        {
                            "key": "CHP6G",
                            "name": "Hexapoda",
                            "rank": "SUBPHYLUM"
                        },
                        {
                            "key": "D2P",
                            "name": "Diptera",
                            "rank": "ORDER"
                        },
                        {
                            "key": "BXZTG",
                            "name": "Episyrphus",
                            "rank": "SUBGENUS"
                        },
                        {
                            "key": "BXZTD",
                            "name": "Episyrphus",
                            "rank": "GENUS"
                        },
                        {
                            "key": "B7XFC",
                            "name": "Syrphini",
                            "rank": "TRIBE"
                        },
                        {
                            "key": "8C2QW",
                            "name": "Episyrphus balteatus",
                            "rank": "SPECIES"
                        },
                        {
                            "key": "N",
                            "name": "Animalia",
                            "rank": "KINGDOM"
                        },
                        {
                            "key": "5T6MX",
                            "name": "Biota",
                            "rank": "UNRANKED"
                        },
                        {
                            "key": "H6",
                            "name": "Insecta",
                            "rank": "CLASS"
                        },
                        {
                            "key": "9H6NG",
                            "name": "Syrphinae",
                            "rank": "SUBFAMILY"
                        },
                        {
                            "key": "GVS",
                            "name": "Syrphidae",
                            "rank": "FAMILY"
                        }
                    ]
                }
            ],
            "type": "Occurrence"
        }
    ],
    "facets": []
}

Searches with checklistKey

Searches with the new request parameter checklistKey will allow users to retrieve records associated with a checklist. This is possibly only of real use for smaller thematic checklists. The checklistKey is the GBIF dataset key for the checklist e.g. 7ddf754f-d193-4cc9-b351-99906754a03b for Catalogue of Life

https://api.gbif-dev2.org/v1/occurrence/search?checklistKey=7ddf754f-d193-4cc9-b351-99906754a03b

This only return occurrence results when the specified checklist is one of the checklists supported by multi taxonomy matching. Occurrence records that have been matched to a taxon in the specified checklist will be returned


Searches with taxonKey and checklistKey

Searches with the new request parameter checklistKey and taxonKey will allow users to specify the checklist in use,. The following would be a query with a taxon from Catalogue of Life:

https://api.gbif-dev2.org/v1/occurrence/search?taxonKey=CB2MR&checklistKey=7ddf754f-d193-4cc9-b351-99906754a03b

The result of this query would be to find records associated with the supplied taxonKey from the checklist specified by the checklistKey. This only return occurrence results when the specified checklist is one of the checklists supported by multi taxonomy matching.


Searches with scientificName and checklistKey

Searches with the new request parameter checklistKey and scientificName will allow users to specify the taxonomy in use when matching the scientificName provided.

https://api.gbif-dev2.org/v1/occurrence/search?scientificName=Episyrphus%20(Episyrphus)%20balteatus&checklistKey=7ddf754f-d193-4cc9-b351-99906754a03b

This will use name usage matching using the checklist with the specified checklistKey. The checklist will resolve the name to a taxonKey in the checklist, and this will be used for occurrence searching.

The result of this query would be to find records associated with the matched taxonKey from the checklist specified by the checklistKey. This only return occurrence results when the specified checklist is one of the checklists supported by multi taxonomy matching.

Facet on checklistKey

The ability to facet on checklistKey with any query to retrieve a list of relevant checklists for a particular search:

https://api.gbif-dev2.org/v1/occurrence/search?facet=checklistKey&limit=0

Will return:

{
  "offset": 0,
  "limit": 0,
  "endOfRecords": false,
  "count": 100,
  "results": [ ],
  "facets": [
    {
      "field": "CHECKLIST_KEY",
      "counts": [
        {
          "name": "2d59e5db-57ad-41ff-97d6-11f5fb264527",
          "count": 100
        },
        {
          "name": "7ddf754f-d193-4cc9-b351-99906754a03b",
          "count": 100
        },
        {
          "name": "d7dddbf4-2cf0-4f39-9b2a-bb099caae36c",
          "count": 100
        }
      ]
    }
  ]
}

The UUIDs returned here are datasetKey values in the GBIF registry.

Facets with checklistKey filter

The facets for taxonKey and higher rank taxon keys e.g. kingdomKey, genusKey will return values based on the GBIF taxonomy by default. If a checklistKey is specified, then results will be from that checklist. For example:

https://api.gbif-dev2.org/v1/occurrence/search?checklistKey=2d59e5db-57ad-41ff-97d6-11f5fb264527&facet=familyKey

Returns facets for familyKey values for WoRMS

{
    "offset": 0,
    "limit": 0,
    "endOfRecords": false,
    "count": 497984,
    "results": [],
    "facets": [{
        "field": "FAMILY_KEY",
        "counts": [{
            "name": "urn:lsid:marinespecies.org:taxname:235102",
            "count": 64908
        }, {
            "name": "urn:lsid:marinespecies.org:taxname:147429",
            "count": 23931
        }, {
            "name": "urn:lsid:marinespecies.org:taxname:196044",
            "count": 18861
        }, {
            "name": "urn:lsid:marinespecies.org:taxname:234449",
            "count": 18357
        }]
    }]
}

Search by any rank

Support search by any taxonomic rank. Applications using the web services can retrieve a list of checklists indexed. With a checklist ID, a list rank key field names can be retrieved:

https://api.gbif-dev2.org/v1/occurrence/search/checklist/2d59e5db-57ad-41ff-97d6-11f5fb264527/rankKeys

Rank keys can be used to search occurrences for non major Linnean ranks such as subphylum, suborder:

https://api.gbif-dev2.org/v1/occurrence/search?checklistKey=2d59e5db-57ad-41ff-97d6-11f5fb264527&subphylumKey=urn:lsid:marinespecies.org:taxname:886369

This example is searching subphylum using the WoRMS checklist.

Search by taxonDepth

To aid UI development, particularly taxonomic tree browsing components, and with Catalogue of Life and other taxonomic sources such as WoRMS, we need to support searching for different ranks, we can support for taxonDepth. This allows the querying the taxonomic tree information based on a numerical depth within the tree as opposed to specific taxonomic rank (e.g. kingdom).

This URL will return root taxa (regardless of rank) for the specified checklist.

https://api.gbif-dev2.org/v1/occurrence/search?checklistKey=7ddf754f-d193-4cc9-b351-99906754a03b&facet=taxonDepth0&limit=0

This URL will return child taxa (regardless of rank) of the taxon with taxonKey=5T6MX for the specified checklist.

https://api.gbif-dev2.org/v1/occurrence/search?checklistKey=7ddf754f-d193-4cc9-b351-99906754a03b&facet=taxonDepth1&limit=0&taxonDepth0=5T6MX

Example output

{
  "offset": 0,
  "limit": 0,
  "endOfRecords": false,
  "count": 450926,
  "results": [ ],
  "facets": [
    {
      "field": "TAXON_DEPTH_1",
      "counts": [
        {
          "name": "P",
          "count": 314969
        },
        {
          "name": "N",
          "count": 135686
        },
        {
          "name": "c2ce3656-5b6e-46ea-b042-2056011ddb30",
          "count": 188
        },
        {
          "name": "B6LM6",
          "count": 78
        },
        {
          "name": "F",
          "count": 4
        },
        {
          "name": "C",
          "count": 1
        }
      ]
    }
  ]
}

Predicate search API

With predicate API the EqualsPredicate and InPredicate have been extended to include a checklistKey field allowing the user to specify the checklist that should be used for taxonomic key fields and taxon depth fields. The predicate API supports searching with multiple taxonomies in a single query. e.g. users can combine a search with a taxonKey from WoRMS and an taxonKey from Catalogue of Life.

Example with single SPECIES_KEY

{
    "predicate": {
        "type": "and",
        "predicates": [
            {
                "type": "equals",
                "key": "SPECIES_KEY",
                "value": "6HQ2Y",
                "checklistKey": "7ddf754f-d193-4cc9-b351-99906754a03b"
            }
        ]
    }
}

Example with TAXON_DEPTH_0

{
    "predicate": {
        "type": "and",
        "predicates": [
            {
                "type": "equals",
                "key": "TAXON_DEPTH_0",
                "value": "5T6MX",
                "checklistKey": "7ddf754f-d193-4cc9-b351-99906754a03b"
            }
        ]
    }
}

Example with multiple SPECIES_KEY values with taxa from different checklists (WoRMs and CoL in this example):

{
    "predicate": {
        "type": "or",
        "predicates": [
            {
                "type": "equals",
                "key": "SPECIES_KEY",
                "value": "5T6MX",
                "checklistKey": "7ddf754f-d193-4cc9-b351-99906754a03b"
            },
            {
                "type": "equals",
                "key": "SPECIES_KEY",
                "value": "urn:lsid:marinespecies.org:taxname:159142",
                "checklistKey": "2d59e5db-57ad-41ff-97d6-11f5fb264527"
            }
        ]
    }
}

For testing with curl:

curl --request POST \
  --header "Content-Type: application/json" \
  --data '{
    "predicate": {
      "type": "and",
      "predicates": [
        {
          "type": "equals",
          "key": "TAXON_DEPTH_0",
          "value": "5T6MX",
          "checklistKey": "7ddf754f-d193-4cc9-b351-99906754a03b"
        }
      ]
    }
  }' \
  https://api.gbif-dev2.org/v1/occurrence/search/predicate 

Example with curl, using WoRMS and multiple species key values from WoRMS:

curl --request POST \
  --header "Content-Type: application/json" \
  --data '{
    "predicate": {
      "type": "and",
      "predicates": [
        {
          "type": "in",
          "key": "SPECIES_KEY",
          "values": [
              "urn:lsid:marinespecies.org:taxname:159142",
              "urn:lsid:marinespecies.org:taxname:159037"
          ], 
          "checklistKey": "2d59e5db-57ad-41ff-97d6-11f5fb264527"
        }
      ]
    }
  }' \
  https://api.gbif-dev2.org/v1/occurrence/search/predicate 
MortenHofft commented 1 month ago

This is a large new functionality, so I suppose large changes is expected. Here are the things that surprised me

Searches with taxonKey and checklistKey in GET API

The current version only allow for one checklist. Which is okay I suppose. It might be unlikely anyone want to use more that one.

I'm not dead aginst this, but it is slightly puzzling because it change the behaviour of what taxonKey refers to. I get 10 results for taxonKey=3, then I add an additional filter for checklistKey=123 and get more results. And I can only add it once, which is a bit unusual, but not crazy - the same goes for flags, but given that this use keys I expected to be able to add multiple.

Ideas For species search we have flags that indicate changed beahviour (verbose=true, strict, qField=SCIENTIFIC). We could have something like matchMultipleChecklists=true which indicate changed behaviour. Once I add that flag, then taxonKey would match against all checklists. And I can then decide to narrow that by adding one or more checklistKey.

Another version: checklistTaxonKey=[datasetKey]:[taxonKey]

Search using predicates

If I undestand the conversation elsewhere correctly then the predicate approach is

{
  type: 'and',
  predicates: [
    {
      type: 'equals',
      key: 'checklistKey',
      value: '123-123-123'
    },
    {
      type: 'equals',
      key: 'taxonKey',
      value: '5dX'
    }
  ]
}

And to only allow the checklist predicate once.

This is confusing to me. Again it isn't clear to me how the 2 predicates in the AND influence each other. And secondly it is odd it only can be used once. And lastly unclear what part of the tree it applies to in that case (I imagine a more complex predicate with multiple AND/OR/NOT)

If it is only allowed once, then it isn't a predicate in my mind, but belongs om the same level as the q param: outside the predicate structure.

Something like {type: equals, key: taxonKey, checklist: '123-123-123', value: 5dX} is easier to understand and more expressive I would think.

Or {type: equals, key: checklistTaxonKey, checklist: '123-123-123', value: 5dX} or even a new type like {type: checklistContext, checklistKey: '123-123-123', predicates: []} which then specifies the taxon scope for anything beneath.

djtfmartin commented 1 month ago

I've updated the main "Draft proposal" a bit to include predicates. I think where i've landed thus far is:

fmendezh commented 1 month ago

Another option to consider is to use the checklist explicitly as part of the services that support multiple taxonomies, for example:

https://api.gbif-dev2.org/v1/occurrence/search/checklist/{checklistKey}?....

https://api.gbif-dev2.org/v1/occurrence/search/checklist/7ddf754f-d193-4cc9-b351-99906754a03b?....
MattBlissett commented 1 month ago

https://api.gbif-dev2.org/v1/occurrence/search/checklist=7ddf754f-d193-4cc9-b351-99906754a03b?....

Another option.