facets in REST API not working as advertised

gbif / portal-feedback

User feedback for the GBIF API, website and published data. You can ask questions here. 🗨❓

30 stars 16 forks source link

facets in REST API not working as advertised #2068

Closed skybristol closed 5 years ago

skybristol commented 5 years ago

Documentation states...

facet: "A list of facet names used to retrieve the most frequent values for a field. Facets are allowed for all the parameters except for: last interpreted date, event date and geometry"

...but I can't get a bunch of the fields to work, and lists of fields that do work result in an empty facets array in the response.

Works...

http://api.gbif.org/v1/occurrence/search?scientificName=Rhyacotriton%20cascadae&country=US&limit=0&facet=basisOfRecord http://api.gbif.org/v1/occurrence/search?scientificName=Rhyacotriton%20cascadae&country=US&limit=0&facet=year http://api.gbif.org/v1/occurrence/search?scientificName=Rhyacotriton%20cascadae&country=US&limit=0&facet=institutionCode

list doesn't work... http://api.gbif.org/v1/occurrence/search?scientificName=Rhyacotriton%20cascadae&country=US&limit=0&facet=year,basisOfRecord

...along with lots of other individual properties that I'd like access to... http://api.gbif.org/v1/occurrence/search?scientificName=Rhyacotriton%20cascadae&country=US&limit=0&facet=datasetName

Did I manage to catch the system in the middle of reindexing, or am I missing something about facets?

MattBlissett commented 5 years ago

Hi Sky,

Very briefly (as it's the evening here):

http://api.gbif.org/v1/occurrence/search?scientificName=Rhyacotriton%20cascadae&country=US&limit=0&facet=year&facet=basisOfRecord — repeat the facet parameter.

http://api.gbif.org/v1/occurrence/search?scientificName=Rhyacotriton%20cascadae&country=US&limit=0&facet=datasetKey (you would need to use the registry API to get the titles).

The parameter names should be the same as the keys in the occurrence response. I see datasetKey, dataset_key and DATASET_KEY all work, I don't think we've finalized the API. The documentation still says the facet parameters are experimental and may change...

skybristol commented 5 years ago

Sweet! Thanks for being so quick! That helps on the multiple facets issue. I'll provide a specific list on the fields that do not seem to be showing facet totals for me that I'm interested in, but this helps quite a bit.

skybristol commented 5 years ago

For the most part, a query like this on taxonKey or scientificName and country is going to generate the basic statistics I need to work with in evaluating availability of observation data for species of interest.

http://api.gbif.org/v1/occurrence/search?scientificName=Rhyacotriton%20cascadae&country=US&limit=0&facet=institutionCode&facet=year&facet=basisOfRecord&facet=datasetKey

The two other text fields that I would like to get facet information on but which do not seem to be returning anything at the moment are the following:

stateProvince
datasetName

I can certainly follow datasetKey to the registry API to pull lots of detail, but I'm really just after the simple dataset name text parameter as a quick visual look at what datasets comprise the bulk of records for a given species. Inclusion of our US States via stateProvince (which seems to be populated in a lot of the data) would be a really nice convenience vs. doing anything spatially with the full set of occurrence records.

MattBlissett commented 5 years ago

stateProvince isn't available for faceting, unfortunately. I should check with @fmendezh, but I think that decision was probably made because we don't interpret the field, so the values will be the original values from the data provider. You would see South Dakota, S Dakota, S.Dak., S. Dak., SD, etc.

Having said that, the American-published data seems remarkably clean here. It's one of the things we intend to improve with the new pipelines-based system.

datasetName (or http://rs.tdwg.org/dwc/terms/datasetName) is a Darwin Core term, but it's not used in our occurrence API.

The query https://api.gbif.org/v1/occurrence/search?country=US&facet=datasetKey&limit=0&facetLimit=50000 will retrieve all dataset keys for all datasets with occurrences in the US, then you need to query https://api.gbif.org/v1/dataset/4fa7b334-ce0d-4e88-aaae-2e0c138d049e (etc) for each UUID. There are 2537 right now.

skybristol commented 5 years ago

Great! Thanks for the additional info. I didn't know exactly what might be happening to stateProvince in the processing steps. I'm curious about your "new pipelines-based system." I talked with @timrobertson100 a little bit when I visited you guys in May about the sort of localization processing that might harmonize something like the stateProvince concept through a spatial indexing mechanism, potentially looping in other value-added indexing to something like watershed identifiers. Through something like the GBIC/GBIO mechanism, different communities could contribute specific processors for the pipeline that would handle and inject value-added properties, not as changes to source Darwin Core fields but as new fields within a particular context or domain.

Place-based indexing is one that almost all of us end up doing in some fashion. In the US, we are interested in working against our National Hydrography Dataset (couple different versions/resolutions) to index taxa observations in different ways to watershed boundaries and stream reach codes to aid in national-scale analyses and assessments. In the marine world, we are indexing OBIS observations to a set of marine areas (EEZs, Large Marine Ecosystems, Ecologically and Biologically Significant Areas, etc.).

Is that the kind of framework you all are setting up already? It would be great to have the ability to build processing algorithms based on some set of rules to hang them on that framework and add them to the powerful engine GBIF has built.

timrobertson100 commented 5 years ago

@skybristol - that is the intention, to allow communities to provide the reference catalogues and then for us to effectively enrich data (e.g. pre-join spatially) as it is processed.

The project is https://github.com/gbif/pipelines and the current focus is to replace Solr with Elasticsearch built using this pipeline, but maintain a V1 compatible API. After in production (in testing now - see this screencast) we'll then look to a V2 API and enriching. The first enrichments will likely be to use a simple vocabulary server we have built (think SKOS and it's close) to allow communities to provide synonyms for controlled terms and to add things like ORCID support. Following that I expect 1) multiple taxonomies such as ITIS for all US records, and 2) spatial catalogues such as you describe.

If you could put some thought into some of the catalogues you would like to see data organised to, that would be appreciated. The catalogue building is the most time consuming part.