gbif / hosted-portals

Support material for establishing the GBIF Hosted Portals
Apache License 2.0
10 stars 6 forks source link

How to tag datasets to create a collection of datasets? #129

Closed siwelisabeth closed 3 years ago

siwelisabeth commented 3 years ago

The Living Norway team would like to have 'data collection pages' in our hosted portal. A data collection page is a collection of datasets of certain topics, i.e 'Education', 'Nature index'

In order to get the datasets that belong to a collection we need to tag the datasets in some way so we can query the gbif api to get the datasets for a given collection in the Living Norway network. Do you have any suggestions on how to do that?

It would also be nice to list all the collection tags, to get the data collections that exists in the the Living Norway network. Is there a way to achive this?

MortenHofft commented 3 years ago

I'm not entirely sure I understand, but there is currently no dataset search. Once available it will be based on the existing APIs which are documented here https://www.gbif.org/developer/registry#datasetSearch

Topics as you mention sound like it could be the same as keywords in the dataset search API

Unfortunately there is no search support for datasets grouped by networks. https://github.com/gbif/registry/issues/338

So for networks we have a few options:

  1. implement https://github.com/gbif/registry/issues/338
  2. have a client/middleware/custom solution for filtering (double the code to maintain and slower if many entries). For this I see following options: 2.1. Pull down all network datasets to the browser and populate a data table of a sort for client side filtering. 2.2. Use a proxy dataset search that consults the network and then sets a root scope of x datasets (could be thousands in VertNets case), then the normal API could be used, but would probably net perform well. 2.3. Create a graphql endpoint that do server side filtering instead of using the proper API. This would spare the frontend from extra code, but would be a half-baked solution that only allowed for a subset of filters.
  3. you write your own solution based on APIs
  4. Manually keeping a simple markdown table up to date (no search beyond ctrl+f)

I would far prefer option 1, but that is easy for me to say as I wouldn't be the one implementing it.

siwelisabeth commented 3 years ago

Thank you @MortenHofft for your reply and the different approaches to this. One question - could it be a good idea to rely on machine tags for this? And query the api for datasets with a certain machine tag? We must then manually maintain a list of the machine tags we are using for our collections.

MortenHofft commented 3 years ago

It could, but there are 2 dataset search APIs. One based on postgresql v1/dataset and one based on ElasticSearch /v1/dataset/search. The former supports machineTags the latter do not. But the latter supports facets and free text queries. Even if we extended the dataset/search endpoint with machineTag search functionality, then you still wouldn't be able to edit the machineTags. That would require new functionalities in the registry backend. So using MachineTags isn't a simple option, but other than that you are right - it is also an option

siwelisabeth commented 3 years ago

Can a machine tag be added to datset by us if we have the correct permissions to the registry? Or is machine tag something that is generated? I see that we can add keywords/tags to the datasets.

MortenHofft commented 3 years ago

MachineTags can only be created/edited/deleted by admins I'm afraid.

Doesn't keywords fit your use case quite nicely? The examples 'Education' and 'Nature index' sounds like keywords. What is you concern with using them as such?

siwelisabeth commented 3 years ago

@MortenHofft Thank you for your reply. I also think that keywords will fit our case. I was just asking about machine tags to ensure we choose the best solution. I was not fully aware of what a machine tag is. Sorry for that :-)

MortenHofft commented 3 years ago

Honestly I'm not sure what is the best solution either, but keywords sounds like it would work for what you describe you need. And best of all it doesn't require any extra development so you can start using them today.

MachineTags was introduced before my time in GBIF, but I believe the original plan was to be able to give access (to say you as a developer) to edit any machineTag within a namespace. But that was never implemented and it is currently admin only. The suggestion to scope machineTags was brought up in another issue recently.

The challenge I can see with keywords is that it require you to edit the dataset archive, whereas tags and machineTags can be edited in the registry/API. Which might be convenient in some cases. But frustratingly tags and machineTags are not exposed in dataset/search. So hence my suggestion to use keywords - it seems the simplest solution right now.