MortenHofft commented 3 months ago

We could add facets to collection search

2 types of metrics would be possible

collection facets - counting collections with a given filter - this is the standard behaviour for facets
specimen facets - counting specimens with a given filter.

specimen facets is the only thing that makes sense within a collection. both make sense for institutions and grscicoll generally, but currently there isn't any data for it.

examples of collection facet questions:

how many collections have data in spain how many collections have data about taxon x how many collections have type specimens of taxon x which is the most prevalent preservation types for this collection breakdowns across collections: how many collections per: kingdom, preservation type, country, type specimens, types/country types/kingdom

examples of specimen facet questions:

Which orders does this collection mainly deal with Breakdown of phyla per country for a collection/institution/total breakdowns for all: specimens per: kingdom, preservation type, country, type specimens, types/country types/kingdom

We could start with collection facets?

e.g. ?country=ES&country=FR&facet=kingdomKey same behaviour as normally

These collection facets is what I'm guessing would be useful: descriptorCountry, country, kingdomKey, phylumKey, ...other taxonGroupKeys..., typeStatus, preservationType, contentType, personalCollection, instititutionKey, active

Ideally we added something new to the API. Namely cardinality of those facets. So an option to, not only get top 10 orders, but also get the number of unique orders. These makes it easier to do UI. Examples where cardinality is used: https://grscicoll.hp.gbif-staging.org/specimen/search?layout=W1t7ImlkIjoiYm1tNW8iLCJwIjp7fSwidHJhbnNsYXRpb24iOiJkYXNoYm9hcmQuc3RhdGlzdGljcyIsInQiOiJvY2N1cnJlbmNlU3VtbWFyeSJ9XSxbeyJpZCI6IjE4NGhxIiwicCI6eyJ2aWV3IjoiVEFCTEUifSwidHJhbnNsYXRpb24iOiJmaWx0ZXJzLmNvbGxlY3Rpb25LZXkubmFtZSIsInQiOiJjb2xsZWN0aW9uS2V5In1dXQ%3D%3D&view=DASHBOARD distinct species, distinct taxa in statistics chart + number of results in collection chart

MortenHofft commented 3 months ago

If going for cardinality, we might want to discuss with the rest of the team what the api should look like.

Ideas: just include it in the normal facet response ?facet=type&facetLimit=2

"facets": [
  {
  "field": "TYPE",
  "cardinality": 4, <==== new field that list the number of facets, not just in the response but in total
  "counts": [
      {
        "name": "CHECKLIST",
        "count": 53833
      },
      {
        "name": "OCCURRENCE",
        "count": 49485
      }
      ]
  }
]

other approach use ?facet=something&cardinality=publisherKey&limit=0&offset=0 and then a distinct response for that

{
  "count": 1000,
  "limit": 0,
  "offset": 0,
  "results": [],
  "facets": ...,
  "cardinality": {
    "PUBLISHER_KEY": 1234 <==== distinct publisherKeys within the given search filter
  }
}

MortenHofft commented 3 months ago

specimenFacet seem more difficult

E.g. count number of specimens per kingdom Quick thoughts on the subject. It could probably be nice within collections if we started to have some collection being richly described. But it seems more difficult - both for the API but also to present it in a meaningful and fair way.

facets: [
  {kingdomKey: 1, count: 123456} // (from 2 csv rows. one with 123000 individuals and another with 456 individuals)
]

individualCount sum across all those descriptors that have that kingdomKey=1 so you would have to get distinct kingdoms within the filter. And for each sum the individual count of all the matching descriptors.

Presentation wise the UI would probably have to show caveats like this for e.g. a kingdom breakdown:

Only 10% of collections have CDs
Only 50% of those with matching CDs have marked them as "no double counting" [?]
Only 20% of collections that provide CDs have marked them as exhaustive (describes their entire collection)
Only 40% of the provided CDs have a scientific name
Only 30% of the matched CDs have a specimen count
That means that this chart is based on 130 descriptors from 4 collections.

ManonGros commented 3 months ago

Thanks Morten! The collection facets and proposed implementation make a lot of sense.

The specimen facets are much more complicated and yes we would have to display a lot of caveats. We would also have to add some other fields to know if the people uploading records have double counted, are exhaustive, etc.

gbif / hp-grscicoll

add facets to collection search #166

We could add facets to collection search

examples of collection facet questions:

examples of specimen facet questions:

We could start with collection facets?

specimenFacet seem more difficult