Versioning - Githubissues

coreylawrence commented 5 years ago

In response to discussions regarding versioning, particularly as it pertains to our description of the dataset in the in-prep manuscript, we have agreed to use at least two counting systems. First, official "stable" releases will be indexed with whole-numbers and will correspond to splitting of a new stabile branch in the repository AND issuing of a new dataset DOI #. Second, interim dataset changes will be indexed using the GitHub repository commit identifier.

Here is how I have tentatively described this in the text of the manuscript:

In the interim between official releases, updates to the dataset (including ingestion of new data) and database infrastructure are tracked through GitHub repository commit identifier, which is a unique alphanumeric string reissued anytime the repository is modified. When accessing the dataset, users should record the most recent stable version number as well as the commit identifier.

To make this work as stated, we need to provide the version and commit identifier in several locations including (1) a print-out on the web interface page where files are downloaded, (2) within the files that are downloaded, (3) within the data object on the GitRepository.

My thought is that we mostly need to track the commit identifier associated with new builds of the database. So in other words, we would need to add code to the build function that reads the current commit identifier and stamps the appropriate files.

This is a little beyond my capacity so hopefully Grey or someone else can spend some time working on this in the near future. Any additional suggestions are also welcomed.

crlsierra commented 5 years ago

Here are a few thoughts on how to achieve this. First, what I mentioned in our meeting yesterday is a functionality of Git, and not GitHub. Corey, correct the second line of the text with the following: ... through Git's commit identifier

I assume we are using tags to mark the release version. If I use the command git describe --tags on the master branch, I get for the current version:

0.0.3-160-g568fdd2

where 0.0.3-160 is probably the latest version number according to the latest tag, and 568fdd2 are the first alphanumeric characters of the commit number. I think this is the complete identifier that should be used, because it contains both the latest stable release version number, and the latest commit number. If you want to see more detail about the latest commit, you can type the command git show --summary, which right now outputs:

commit 568fdd2ba1931591ebcc006f4168b8d422fc8e29 (HEAD -> master, origin/master, origin/HEAD)
Merge: bfad9dd d184ce1
Author: Jeff B <jbeemmil@gmail.com>
Date:   Thu Jan 24 10:28:58 2019 +0100

    Merge pull request #157 from AuHau/master

    Correcting metadata for the XLSX viewer (#156)

The difficult part on how to print this information on the website and on the files that people download. Corey mentions that we could use the build function for this, but the problem is that the build probably is run before you make the latest commit. If you run the function, update versions and commit, the print out from the build function would be one commit behind. So, I think this won't be a good solution.

We probably would need to use a different tool that adds this information after the commit. I honestly don't know how to do it, and it may be difficult to do. I would simply recommend that if a user wants to know what specific version he/she is using, the git commands describe and show would do the job.

coreylawrence commented 5 years ago

Thanks Carlos, your response is extremely helpful. Base on your explanation, I agree that providing users with the necessary information to apply the git commands you mention in order to identify the version of the database they are using is the best way forward for advanced users that are accessing the Git repository directly.

In addition, it seems like it should be easy enough to include a function in the ISRaD-R package that returns the same information. That just leaves the users, who are downloading data directly from the web interface. For those folks, maybe there is a way to regularly update a the html files with the the output string from git describe --tags call?

greymonroe commented 5 years ago

I agree with Carlos, I think automating this would be difficult, if not impossible. Someone will have to change the website each time, which is not that much work.

Alternative idea: I know that we decided on the commit version as the means of indicating the database version, but another easy solution is to just ask (require) users to record the date they accessed the data. This has been standard for a long time with web-based resources (ie. when citing a website you indicate the date accessed). Another example is GBIF.org which is constantly growing. People just state the date they downloaded the data in their methods section.

greymonroe commented 5 years ago

@crlsierra can you link you some examples where people have used this approach for versioning datasets? I seem to remember you mentioned that it is common for certain journals. It would be helpful to see how they executed this.

greymonroe commented 5 years ago

This post explains why tagging the datasets with their most recent commit cant really be done. https://stackoverflow.com/questions/14208272/know-git-hash-before-committing

crlsierra commented 5 years ago

I agree with Grey and the post he shares. It is impossible to tag the datasets with the current commit. This con only be done one commit behind.

Here's a guide on how to prepare a release and get the doi from zenodo. An example on how to link cite and link this information is this paper. Check the code availability section.

greymonroe commented 5 years ago

see https://international-soil-radiocarbon-database.github.io/ISRaD/database/ for information about versioning

International-Soil-Radiocarbon-Database / ISRaD

Versioning #161