gbif / portal-feedback

User feedback for the GBIF API, website and published data. You can ask questions here. 🗨❓
30 stars 16 forks source link

Thematic analytics #4505

Open gbif-portal opened 1 year ago

gbif-portal commented 1 year ago

Thematic analytics

As discussed with @tobiasgf @thomasstjerne , it would be nice to have regular analytics, similar to country reports, for the key thematic segments in GBIF, especially DNA, soil, freshwater, etc. @kingenloff we will need this for heath, too.


Github user: @dschigel User: See in registry - Send email System: Chrome 108.0.0 / Windows 10.0.0 Referer: https://www.gbif.org/health Window size: width 1847 - height 913 API log&_a=(columns:!(_source),filters:!(),index:'3390a910-fcda-11ea-a9ab-4375f2a9d11c',interval:auto,query:(language:kuery,query:''),sort:!())) Site log&_a=(columns:!(_source),filters:!(),index:'5c73f360-fce3-11ea-a9ab-4375f2a9d11c',interval:auto,query:(language:kuery,query:''),sort:!())) System health at time of feedback: OPERATIONAL

dschigel commented 1 year ago

implementation options vary from a static regular PDF summaries to on-portal thematic section to hosted portals. Key elements: 1) numbers, all kinds 2) map 3) taxonomy, e.g. pie or box chart plus a tree (if dynamic, like metrics)

ManonGros commented 1 year ago

@dschigel, for any type of statistics, someone would probably have to identify and tag all the relevant datasets

dschigel commented 1 year ago

Yes. Before BoR is fixed, this will have to be done at the i) dataset (tagging) ii) extension use and sometimes iii) publisher level. Example: everything from the MGnify publisher should contribute to the DNA analytics, we don't need to bother with dataset tagging and extension was not used - confirmed by @thomasstjerne

ManonGros commented 1 year ago

I would expect that for this kind of statistics, we probably want everything to be accessible in one query. We wouldn't want something like "this publisher and those three datasets and these three records", we probably want something like "all the datasets with tag A". Maybe when we have categories for datasets (https://github.com/gbif/registry/issues/247), some things will be easier. But in any case, someone will have to identify the datasets that you want to make metrics on. That probably will be the most time consuming. Do you already have a list of health and soil datasets?

dschigel commented 1 year ago

Not yet, but this can and should be done once we here from INF

MattBlissett commented 1 year ago

Would soil, freshwater etc be dataset-based metrics, or taxon-based?

dschigel commented 1 year ago

Needs thinking and testing, but a quick answer is datasets based (in plural), with some cases where tagging by theme can be speeded up / verified / automated by detecting use of known extension or by origins from certain publishers.

thomasstjerne commented 1 year ago

Would soil, freshwater etc be dataset-based metrics, or taxon-based?

For DNA, this is currently something like: publisherKey in [key1, key2, ...] OR has DNA derived data extension OR has GGBN amplification extension

dschigel commented 1 year ago

Can we test this with DNA, to start with? The WP2023 says: Projected outcomes for 2027: Knowledge gaps are reduced by consolidating data coverage across the thematic areas of relevance. How do we know we did? But doing regular analytics, the before and after, theme by theme. Would country analytics code x criteria offered by @thomasstjerne do the trick?

dschigel commented 1 year ago

Maybe we can think about this with a country report at hand as a model https://analytics-files.gbif-uat.org/country/DK/GBIF_CountryReport_DK.pdf - we can even plan over a printout which elements are applicable for thematic analytics, which not? Even if analytics as vizualisation will not be ready soon, a capture of January state of data per theme is necessary. DNA is the most compact and straightforward to try, once we know it will work, we can proceed to tag soil, health and freshwater datasets (2023 priorities), then eventually all the rest.