glosario
is an open source glossary of terms used in data science
that is available online and also as a library in both R and Python.
By adding glossary keys to a lesson's metadata,
authors can indicate what the lesson teaches,
what learners ought to know before they start,
and where they can go to find that knowledge.
Authors can also use the library's functions
to insert consistent hyperlinks for terms and definitions in their lessons
in any of several (human) languages.
To advance data science knowledge and accessibility for our diverse community, we developed Glosario. You do not need to know any programming language to contribute to Glosario: anyone with a basic familiarity with the GitHub web interface can get involved! We have prepared a detailed and accessible guide for contributing, which has been translated into several languages. Contributions are welcome in any language, not only those represented in that document. If you need help with your contribution, feel free to come to ask questions on the #glosario Slack channel (if you are not a member of The Carpentries Slack you can join by filling this form).
R Markdown and Jupyter Notebooks allow authors to place structured metadata in files. We propose the following metadata (written as YAML):
glossary:
sources:
- http://some_glossary.org/something/
language: fr
requires:
- aggregation_function
- call_stack
defines:
- closure
- name_collision
source
key is required.
language
key is required
and must be a single ISO 639 language code
(e.g., fr
for French).requires
and defines
are optional.
requires
to be used without being defined in this lesson
(i.e., the lesson author assumes users already know them).defines
must be hyperlinked in the lesson.
GLOSSARY_SITE#glossary_key
,
where GLOSSARY_SITE
is one of the sites listed under the sources
key
and glossary_key
is an exact match for one of the defines
keys.We will provide simple tools so that all of the terms listed in a lesson's metadata are linked correctly in its body. We will also provide shortcuts to make it easy to create correctly-formatted links so that authors can write things like:
The computer uses a `r link('call stack', 'call_stack')` to keep track of function calls.
Any site where glossary URLs resolve can be used as a glossary. As a working model, this project implements a glossary of terms used in data science and data engineering.
glossary.yml
.
Its format is described below.glosario
and an R package with the same name.A glossary entry is structured like this:
- slug: cran
ref:
- base_r
- tidyverse
en:
term: "Comprehensive R Archive Network"
acronym: "CRAN"
def: >
A public repository of R [packages](#package).
slug
key identifies the entry.
ref
key.
If it is present,
its value must be a list of identifiers of related terms in this glossary.en
or fr
.
term
is the term being defined.
This key must be present.acronym
is optional.
If present, its value is the acronym for this term.def
is the definition.
This key must be present,
and the value may contain local links to other terms in this glossary
(i.e., links starting with #
)
and/or links to outside sources.Should we provide one function for interactive definition lookup that searches keys and terms, a separate function for each, or some kind of keyword arguments to control the scope of search?
Should we integrate definition lookup with existing help systems?
For example,
should define('something')
in RStudio put the definition in the help pane
(and if so, should it hyperlink to terms that the definition depends on)?
Linking to a definition.
glossary/language
key
in the YAML header,
but has not changed any other settings.`r gdef('linear-model', 'Linear models')`
to her lesson.<a href="http://carpentries.org/glossary/es/#linear-model" class="glossary-definition">Linear Models</a>
Checking a lesson.
glossary/defines
key.gdef(...)
.glossary/defines
is referenced in the document body,
and that every term referenced in the document body is mentioned in glossary/defines
.Finding lessons.
glossary
key to its YAML metadata
and indicates that the lesson requires the term correlation
and defines the term regression
.regression
.rmarkdown::yaml_front_matter(filename)
to read metadata from all of the lessons she has archived.regression
.Summarizing a lesson.
correlation
and causation
.glosario::summarize_terms()
.dl
at that point.
Its entries are the definitions of
all of the terms listed under the glossary/defines
key
in the page's YAML header
in alphabetical order by term according to the rules for glossary/language
.Why not just link to Wikipedia? We expect that many glossary definitions will do so, However, Wikipedia articles provide explanations, not definitions.
YAML is hard for people to edit—why not use something else for the glossary file? Because other formats are just as hard to edit (e.g., JSON) or make one-to-many relationships hard to express (e.g., CSV).
Why use Jekyll for the online version? It is the default for GitHub Pages.
SADiLaR is one of the collaborators in the finalisation and expansion of the Glosario Project to African Languages. SADiLaR is a research infrastructure established by the Department of Science and Innovation of the South African government as part of the South African Research Infrastructure Roadmap (SARIR).
We are pleased to share that the Andrew W. Mellon Foundation approved a grant for use over 12 months (November 2023 through October 2024) to support an upgrade to Glosario.