carpentries / glosario

A multilingual glossary for computing and data science terms.
https://glosario.carpentries.org
Other
107 stars 227 forks source link

Local sync of glossary.yml #27

Open zkamvar opened 4 years ago

zkamvar commented 4 years ago

In https://github.com/carpentries/glosario-r/pull/11, the mechanism for updating the glossary is governed by a github action that will update the internal glossary daily.

In https://github.com/carpentries/glosario-py/issues/1, it is suggested to remove the glossary.yml file from the repo and have it dynamically built.

I think it would be a good idea to think about how we go about allowing people to synchronize the glossary locally so that we can decouple the data from the API.

My thoughts on how to go about this are largely centered around the patterns I see from R users and reproducibility (I admit that I do not know much about the python side of things):

  1. people don't update their packages that often or don't know how to update their packages.
  2. there is no clear indicator of what version of the glossary people have on their machines, so if it defaults to the one in the package, then definitions that exist in the global glossary may be missing in their local version.
  3. if {glosario} is released on CRAN, it will be updated every two months (as per CRAN's policy) at most, but the main glosario repository will be constantly updating.

These situations mean that if Belle installs the package on March 4th and Sebastian installs the package on July 17th, they will have two different versions of the glossary on their machines. Let's say they contribute a few new definitions to the main glossary on July 16th, but neither of them see these definitions on their lesson because the package was only pushed to CRAN on July 1st.

I think it would be good to consider these situations before we release this to CRAN and coordinate with the python implementation so that we can reduce the friction that users see.

_Modified from a comment originally posted by @zkamvar in https://github.com/carpentries/glosario-r/pull/11#discussion_r460106889_

gvwilson commented 4 years ago

The glossary (file/package) needs to include either a version number (or more readably, a publish timestamp) to make work fully reproducible.

zkamvar commented 4 years ago

An idea: if one of the points for this repo is a tool for teaching git, we could use the commit hash as the ID. When the user asks to update the file, we grab the latest commit hash from GitHub and compare it to the one the user has (this uses minimal bandwitdth). If it matches, we do nothing because they are up-to-date. If it doesn't match, then we download the new version and replace the hash.

Because this is dynamically generated, we can just prepend the hash to the source file as a yaml entry, which also means that we can include a timestamp for that commit to improve readability.