gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Expanding LifeStage vocabulary #488

Open djtfmartin opened 3 years ago

djtfmartin commented 3 years ago

ALA would like to add a number of terms to expand the LifeStage vocabulary. There are over 1,700 values in use in ALA, a lot of which possibly cant be mapped but we'd like to know where to start with updating the vocab to include commonly used terms such as "Pup". Noone at ALA currently has editing rights for the vocabulary hosted in registry.gbif.org

timrobertson100 commented 3 years ago

The way I'd recommend approaching this is similar to how we created the first edition.

In the original spreadsheet you will find several sheets.

The verbatim values sheet describes the original data values seen used on a significant number of records, or by 5 datasets (chosen to keep volumes manageable).

I'd suggest processing your data against the existing GBIF vocabulary which presumably captures most things, and then for things that do not map to anything (perhaps limited by N datasets using it so you have some idea it is a standard use) go through the process of identifying why. As you go through that list there are 2 things you need to capture:

  1. Concepts that you find are missing completely, for which we need a new term and definition
  2. Synonyms missing (e.g. pup is presumably some alternative of Juvenile?) which can be captured in a 2 column spreadsheet

We had an external review last time from the TDWG group and we'd probably review changes again before applying. We can then coordinate an import and release process which is something we haven't yet tested and why the vocabulary server is not open for editing.

Is that doable do you think?

tucotuco commented 3 years ago

@pzermoglio is also following up on this process. We're in a similar situation with the outcome of the North American Ornithological Conference vocabulary resolution exercise with sex, lifestage, and preparations (which doesn't have controlled vocabulary as a recommendation, but they feel a great need for it). Do we need to queue the lifestage reconciliations?

timrobertson100 commented 3 years ago

Thanks @tucotuco

@djtfmartin - we have a meeting tentatively next week to start looking at the outcomes of that with @pzermoglio. Is there a data manager at ALA who could explore what I outline above in the next few days and perhaps join that meeting so we can discuss how this process should work going forward?

djtfmartin commented 3 years ago

Thanks @timrobertson100 @tucotuco

We'll have a look at the spreadsheet and I'm sure one us can join the call. The time of the call will dictate who can join.

Apologies if this is covering old ground but I thought we planned to use registry.gbif.org to manage community driven edits/updates to vocabularies. Is this no longer the case and we need to do this outside of this tool ?

I also wondered if we do use registry.gbif.org if we (ALA, Living Atlases) would be able to add values and have them marked as "draft" initially. Then GBIF could chose to avoid using draft synonyms/concepts but other installations of pipelines could use these with certain config settings. This might be a good way of sourcing terms for consideration.

timrobertson100 commented 3 years ago

Thanks - that is the plan but we're just not quite there yet with the process. It is still the approach I'd recommend with the pipeline adoption. It would be good if a data-manager/domain-specialist from the ALA team could be identified in the coming weeks to participate in shaping the process.

javier-molina commented 3 years ago

cc @charvolant

djtfmartin commented 3 years ago

I also wondered if we do use registry.gbif.org if we (ALA, Living Atlases) would be able to add values and have them marked as "draft" initially.

Does the registry support a notion of a value being "draft" ? And if so, could we expose these draft terms in the WS for vocabularies ?

timrobertson100 commented 3 years ago

It doesn't at the moment I'm afraid. At this point, it's the released version of the vocabulary or you should drop in a different transformer for LifeStage.

timrobertson100 commented 3 years ago

Where is your current vocab or mapping file, please @djtfmartin? I presume it is not src/main/resources/lifeStage.txt.

It would be useful to know how different we are already, and the frequency of changes you're used to seeing in this. I'd expect GBIF could turn around a modest set of changes within days with a data review (coordinated by the TDWG group) but without knowing how different they are it's difficult to know what's a sensible option to recommend.

djtfmartin commented 3 years ago

We aren't doing much more than this in production for lifeStage. I was asking the questions with other vocabularies in mind and try to get an idea of the current state of play so we can set expectations about what we can do to update vocabs.

We had planned on sourcing all (present and future) vocabularies in registry.gbif.org and hoped to use collaborating editing functionality of registry.gbif.org. I'd like us to move away from having separately configured vocabs in each of the LA installations, but if its deemed to be too hard/too slow to get changes through then people will resort to local configs.