ManonGros commented 3 years ago

Here is a file to edit: https://drive.google.com/file/d/1wQ21ShfKNRrJ8VJMbNzbgHzNCHEn7laf/view?usp=sharing

It contains:

a list of the values already mapped to some concepts (they are all in the Hidden sheet/tab for now)
the GBIF verbatim values for this field that appear more than 10,000 times or in 5 or more datasets

NB: For this vocabulary, please add the concepts by using the language enumeration: https://api.gbif.org/v1/enumeration/language

Pease check instructions here: https://github.com/gbif/vocabulary/issues/70

ahahn-gbif commented 3 years ago

Could you give me access to this one, please? Thanks!

ahahn-gbif commented 3 years ago

@timrobertson100: I assume that we primarily use this in the the publisher and dataset metadata. Are there other uses of this vocabulary that need to be taken into account?

ahahn-gbif commented 3 years ago

32 (closed issue): what is the scope of the language vocabulary?

timrobertson100 commented 3 years ago

@timrobertson100: I assume that we primarily use this in the publisher and dataset metadata. Are there other uses of this vocabulary that need to be taken into account?

I don't think so, no

tucotuco commented 3 years ago

Do you not intend for this to be used beyond GBIF's needs? Codes for vernacular name languages was already identified as a major use case. In addition, the Occurrence Core, Event Core, and Audubon Core have dc:language at the record level. Audubon Core also has dcterms:language, metadataLanguageLiteral, and metadataLanguage at the record level.

timrobertson100 commented 3 years ago

the Occurrence Core, Event Core, and Audubon Core have dc:language at the record level. Audubon Core also has dcterms:language, metadataLanguageLiteral, and metadataLanguage at the record level.

Good point. I'd assume it drives those as any existing dictionary file likely does.

(The immediate priority is on interpretation needs in GBIF/ALA pipelines)

ahahn-gbif commented 3 years ago

Some assumptions to verify before starting:

ISO 639-1 (https://api.gbif.org/v1/enumeration/language: ISO 639-1 and 639-2) provides sufficient granularity to provide the concepts
we would not want English language language names as concepts, but rather neutral entities (ISO 2-letter codes)
English language names serve as labels, not as concepts, just as Spanish etc equivalents (and native titles?)
national/regional variants like "es-AR" (http://www.lingoes.net/en/translator/langcode.htm) are mapped as hidden values and interpreted to the 2-letter code
multi-value verbatim data ("en | ru") for e.g. dataset descriptions containing text in both languages: handlling unclear. For GBIF use cases (finding a description in Russian language) it might be best to allow explicit mixed-content Concept definitions in standardized syntax

@tucotuco, @timrobertson100, does any of this already raise alarm around use cases you are aware of?

tucotuco commented 3 years ago

It looks good except that I suspect ISO 639-2 is insufficient for all known purposes in our community, especially ethnobiology. If you want, I can try to get a confirmation of that from Jonathan Amith, linguist and progenitor of DEMCA (https://demca.mesolex.org/portal/).

On Tue, Apr 20, 2021 at 12:27 PM Andrea Hahn @.***> wrote:

Some assumptions to verify before starting:

ISO 639-1 (https://api.gbif.org/v1/enumeration/language: ISO 639-1 and 639-2) provides sufficient granularity to provide the concepts

we would not want English language language names as concepts, but rather neutral entities (ISO 2-letter codes)

English language names serve as labels, not as concepts, just as Spanish etc equivalents (and native titles?)

national/regional variants like "es-AR" ( http://www.lingoes.net/en/translator/langcode.htm) are mapped as hidden values and interpreted to the 2-letter code

multi-value verbatim data ("en | ru") for e.g. dataset descriptions containing text in both languages: handlling unclear. For GBIF use cases (finding a description in Russian language) it might be best to allow explicit mixed-content Concept definitions in standardized syntax

@tucotuco https://github.com/tucotuco, @timrobertson100 https://github.com/timrobertson100, does any of this already raise alarm around use cases you are aware of?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gbif/vocabulary/issues/77#issuecomment-823366488, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ723EAOJCMJPWVPI64RLTJWMM7ANCNFSM4ZMEHTCA .

MattBlissett commented 3 years ago

ISO 639-3 might be needed, but I haven't investigated myself.

https://en.wikipedia.org/wiki/ISO_639-3#Usage has some links to other language-related systems, several depending on ISO 639.

ahahn-gbif commented 3 years ago

Thanks, both! For practical purposes, that sounds as though we will eventually need multiple levels of granularity in the concepts list, with explicit parent declarations, rather than a single-level flat list. @marcos-lg, are hierarchichal vocabularies something already covered, or would that add more complexity than we want to handle in the first phase, please?

ahahn-gbif commented 3 years ago

(from TimR via Skype): "LifeStage is in production and is an example of a hierarchical vocabulary It’s intended for 1 (maybe 2) levels deep. I’d advise anything more complex needs thought."

gbif / vocabulary

Language - curation before uploading first vocabulary version #77

32 (closed issue): what is the scope of the language vocabulary?