gbif / hp-colombian-biodiversity

https://biodiversidad.co
5 stars 13 forks source link

New publisher not "filtering" #39

Open camiplata opened 2 years ago

camiplata commented 2 years ago

We published this cheklist from a new publisher , on June 30, when we search for the cheklist name or publisher using the free text filter in or portal we get the expected result but when we search for publisher we don't get any hit:

Free text search:

Captura de Pantalla 2022-07-05 a la(s) 9 44 11 a  m

Publisher search:

Captura de Pantalla 2022-07-05 a la(s) 9 44 29 a  m

PD: probably this is for another issue, for the datset view is there a Dataset title filter avilable? when i go to more I can't see one.

MortenHofft commented 2 years ago

That is a bug that I have introduced recently. And only you see it because you are the only ones using dataset search I believe. I recently added the default to scope publisher search by what is in occurrence data scope. But since the publisher has no occurrences, then it doesn't show. I've changed that suggest to use the normal publisher suggest. Unfortunately that means that it will suggest all publishers in GBIF - just like it did 3 months ago. I will look into that, but it requires some more work. So I just wanted to fix this now.

I've created an issue that might help us scope the suggest https://github.com/gbif/registry/issues/427

MortenHofft commented 2 years ago

This issue https://github.com/gbif/registry/issues/421 becomes a real nuisance in this speciifc case

camiplata commented 2 years ago

Some ideas:

  1. For the dataset scope and especially with checklist other variables could be added to avoid these issues, for example adding all datasets from publishers of Colombia, this will add all datasets regardless of their core. I thought that was already taken into account in the config.js with the predicate:

    {
          "key": "publishingCountry",
          "type": "equals",
          "value": "CO"
        }

    But by the behavior we are having I realize this is a criterion based on the occurrences and not on the publisher.

  2. For the checklist, the scope could be also built upon other dwc elements such as countryCode on speciesProfile taxon core extensión, this will allow us for example to include important national checklists such as GRIIS that are created by Colombian experts but published by international organizations, or Plazi checklist related to Colombia. Although for Plazi I think they are adding the distribution extension but may be empty for some checklist I'll make an issue for that on the GBIF portal as a suggestion.

I hope these ideas can help, if not let me know and I will try to look for other strategies.

MortenHofft commented 2 years ago
{ 
  "key": "publishingCountry",
  "type": "equals",
  "value": "CO"
}

This part is a part of the configuration for occurrences and will include all datasets with occurrences, no matter the Core. Dataset search can be configured very differently. Your dataset search is configured to show datasets published from Colombia https://github.com/gbif/hp-colombian-biodiversity/blob/master/_includes/js/config.js#L49 unlike your occurrences that include data from any publisher as long as the point has coordinates in Colombia.

For dataset search we rely on the features of that API. And that does not allow us to include all datasets that have e.g. either a single occurrence point in Colombia or is published from Colombia (including datasets without occurrences).

Another challenge is then suggest. When suggesting for example publishers, then we rely on optimised suggest endpoints that respond quickly and have some degree of fuzzy matching on the text input. But those do not allow us to do any filtering of the suggestions. That is the same problem we have in occurrence search where a dataset suggest would suggest al possible datasets known to the GBIF API. I found a way to circumvent that issue by creating APIs specific to this project and by accepting worse matching, but at least ensure that the suggestions was part of the data for that website.

So previously a search for Nacional Herbario would return a suggestion for Herbario Nacional Colombiano (COL) - but now it will not. It now requires the text to be more similar (you will have to use the correct ordering Herbario Nacional). That is a shame, but that was the cost of limiting the suggestions to the datasets that is actually in scope for the individual sites. So on the positive site you will no longer see suggestions for CUBA:Herbario del Jardín Botánico Nacional, La Habana, Cuba: HAJB-Pteridophyta

camiplata commented 2 years ago

Thank you for the explanation.

In response to:

For dataset search we rely on the features of that API. And that does not allow us to include all datasets that have e.g. either a single occurrence point in Colombia or is published from Colombia (including datasets without occurrences).

I understand it is not posible now, but can it be something to work during the second phase of the hosted portlas? I think this is something that other portals will also need.

For the second idea. that now I see it more as a need, do you have some thoughts? :

allow us for example to include important national checklists such as GRIIS that are created by Colombian experts but published by international organizations, or Plazi checklist related to Colombia.