benmarwick / rrtools

rrtools: Tools for Writing Reproducible Research in R
Other
670 stars 85 forks source link

[feature] Warn of known bugs in used software (bug-for-bug compatibility) #107

Closed joukewitteveen closed 2 years ago

joukewitteveen commented 4 years ago

Word of warning: This issue came up at an interesting talk by @annakrystalli. I have no time to help out, but she encouraged me to post this regardless.

Consider a hypothetical library X that, in version 1.0.0 contains an obscure bug where 0.5683/0 evaluates to -infinity, in violation of IEEE754. From the perspective of the library developers, this is a silly bug and a new version is released quickly, where 0.5683/0 evaluates to +infinity. Despite https://xkcd.com/1172/, this is seen as an improvement in adhering to documented behavior, so the new release gets version number 1.0.1 (bug fix) and not 2.0.0 (breaking change).

If we look at https://github.com/benmarwick/rrtools/blob/58e842ee6bc2067fd4046fa50766b5691a66655e/inst/templates/Dockerfile#L13 we see that this new version will likely be picked up by our docker image as soon as it finds its way to the repositories. The result is that we may no longer be able to reproduce something done with X v1.0.0 anymore. In other words, reproducibility requires bug-for-bug compatibility with the original environment and apt-get update breaks this.

The Bug Ideally, rrtools should refer to frozen repositories or otherwise limit the possible impact of apt-get update.

The Feature Request Perhaps not as part of rrtools, but it would be nice if there was a tool that could take a compendium, analyze the software+version that is used in it, and warn the author(s) if any of the used software is known to contain a (numerical) bug. We do not want outcomes to be skewed by software errors, but there is very little to protect us from them.

benmarwick commented 4 years ago

Thanks for sharing your observation. Do you have any suggestions about how we can fix this?

joukewitteveen commented 4 years ago

Looking at some packages in the Ubuntu package repository, it appears as if old versions of packages are not removed. Maybe The Bug can be fixed by not running apt-get update and instead revert to a known set of versions of packages.

For The Feature Request, I am not so sure. This is probably an entire project in and of itself. The logic you had probably figured out yourself already: scrape the compendium for all versions of all dependencies, check with some central database for known flaws, report.

benmarwick commented 4 years ago

Thanks for your reply. It looks like we may be able to do something like this, known as 'version-pinning':

RUN apt-get update && apt-get install -y \
    package-foo=1.2.*

Another method for tackling this might be in the containerit packages. If this method addresses your concern, maybe we should use containerit to generate the dockerfiles here? What do you think?

nevrome commented 4 years ago

As Matthias Hinz points out in the thread you linked, @benmarwick, a good, general solution of this issue is currently impossible:

The main problem is, that the repositories mostly provide the most recent version of a package only. Even if there were repositories with historic packages, we would still have to match libraries with package names and map between version-tags, which may vary depending on the platform and architecture.

As far as I understand reproducibility with docker you should save the resulting image (with docker save) at the end of the project and not rely on the Dockerfile at all. It's a build instruction that only works if external services (software archives) are working as expected and you can't rely on them for the far future of 5-10 years (pun intended).

If the version problem you describe, @joukewitteveen, occurs as long as your still working on the project -- for example when a new software version introduces a critical bug -- then version-pinning or direct source code download (as introduced for some software in containerit) might be a solution.

But again: This is IMHO only a temporary solution until you finish your work on this project and store everything in neatly packed virtual machine image.

joukewitteveen commented 4 years ago

Let me restate that I am not currently having any problems. This issue is (to me) conceptual.

Although I agree that saving a complete image is probably the best solution for future reproducibility, this is often not very practical and many software archives maintain copies of older versions, so there may be hope. Either way, it would be beneficial if a compendium contained a list of dependencies with exact versions, for instance in a yaml file. A helper script can then try to provision the docker image with the exact versions of these dependencies (alleviating the author of this task), while another script may link the dependencies in this file to a database with known technical issues, so that a compendium can be flagged automatically if it relies on a piece of software with known problems.

nevrome commented 4 years ago

Well there are at least two projects that attempt to provide a catalog of install commands for system requirements: r-system-requirements and sysreqsdb

A system like the one you describe could be added there as another layer, I guess.

benmarwick commented 4 years ago

I'm going to take a closer look at the containerit package to see if that can help us with this.