R CRAN package binaries for Linux - 2.0

Dear Working Group,

for the past months I have been building on a new project that I internally call "R CRAN package binaries for Linux - 2.0" (2.0 as Posit PM was the 1.0 in my mind). Hereby I'd like to introduce it to the working group and gather feedback and get ideas for a sustainable future.

Background

CRAN does not provide package binaries for Linux. For many years, Posit is doing so for several OS but only one architecture.

Given the rise of arm64 and docker-based actions in recent years, the need emerged to also have package binaries for arm64 and the commonly-used OS for docker-based workloads, Alpine.

I've reached out to Posit a few times whether they have plans to extend their build chain to arm64 with no success - I only got one brief answer from multiple sources that there are "no plans to do this in the foreseeable future".

Concept

I've head the idea in mind for quite some time but was so far lacking time and financial resources (to get started with). Starting a LTD two months ago, I now have both and my motivation was high to finally tackle this issue.

So I started to build R package binaries on Linux for the following matrix:

Arch: arm64 + amd64
OS: Ubuntu 22.04 + 24.04, RHEL 8 + 9, Alpine 3.20

This results in a total of twelve times building CRAN (including all historic package versions), so roughly 12 x 20k x 6 (6 being the average version count of CRAN packages) = 1,4 Mio binaries.

General Details

The binaries are already built and are distributed through a global CDN. The CDN showed download speedup times of up to 10x compared to downloading the same binary from Posit PM (from a European location).

While I initially only had the plan to build arm64 binaries, I realized that adding a new repo for only arm64 binaries is cumbersome: one has to switch repos for arm64/amd64 builds and in addition needs a CRAN src repo to ensure all packages are available (as some binaries are not available due to build failures).

Binaries have been built once off starting on date X. Since "start day" +1, the updated packages are being processed. This includes:

Building binaries for updated packages
Archiving old packages
Removing packages removed from CRAN

This is then followed by a task that updates the PACKAGES index file which is responsible to provide an index for all available packages.

Technical Details

Binaries are built on a k3s-based kubernetes cluster with autoscaling. This means that recurring jobs will execute the tasks mentioned above and shut down the server after that automatically (usually takes 20-50 mins/day depending on the amount of packages).

The backend for all builds are individual docker images I've crafted that contain a robust compiler setup including further components like X-server, Blas, Chromium, and other tools needed to build many packages.

System dependencies are inferred via pak which make use of https://github.com/rstudio/r-system-requirements to automatically infer these from a package DESCRIPTION file.

With respect to storage, I might have opted for a new approach: storing binaries in S3. While this sounds like a logical solution per se, all related tools and helpers (pkgs tools, cranlike and desc) are not compliant to work with S3 at the moment. Hence I forked them and added S3 support. For cranlike e.g. this means supporting updating PACKAGES on a remote S3 location using the etag for the required md5sum hash value.

Sustainability & Quality

While so far I've built around 1,4 million binaries including daily rebuilt jobs, there's more to keep such a project alive.

One point is of course to improve the underlying Containerfiles with respect to their compiler config as there are still (too many) packages which fail to build due to some C-code issues. While my plan is to provide precise statistics for each package (as I am storing the build metadata for every binary in a Postgres DB), Alpine is surely the most complicated one due to the fact that is uses MUSL instead of GLIBC as the C library.

Besides the build and storage costs (which I don't want to share in this post here), there's also the distribution cost. Distributing the binaries through a global CDN with local storage caches on different continents seems to be a great solution to me. I don't want to store the assets in a single location/S3 bucket and then have many requests with a high latency and overall travel time.

All of the above comes with some costs. I didn't mention the person hours yet, but so far I think I am somewhere around 300h that I've invested into the project. Storage and server costs so far are between 1-2k.

My goal is not to maximize profits with this project, even though I am placing it within my recently founded company. I want to help the R community to advance to the "present" of today's possibilities (WRT to architectures and asset delivery) and make use of the binaries myself. Placing it under the umbrella of my company helps me to finance and justify it as a "professional" project. I am aware of the R consortium funds and applying for a grant is definitely in scope. However, I wanted to first share this project with the WG before proceeding with such.

Overall, I am looking for feedback and support with this project to make it sustainable, both in terms of technical and financial support. The source code is not yet public as I still need to document it properly and "clean up" - but I am definitely planning to do so. In contrast to Posit PM, I would like to develop/maintain the project in the open and encourage everyone to contribute.

Patrick Schratz

I think building arm64 binaries is something CRAN maintainers have in mind, as I think they briefly mentioned serving arm64 binaries in previous meetings with them. There is interest also from other Linux distributions to have linux binaries built from CRAN for arm64 (Fedora that I've heard of, but others too probably).

CRAN in their recent presentation at useR!2024 incldes a section: "Help with core CRAN services". This seems like a core CRAN service so it seems you could get support from CRAN. Maybe next meeting, November 11th at 17 CET, you could present this and get more feedback. If you don't get the invitation, I'll send it to you.

More practical ideas, I wouldn't built old package versions on default. From what I read from Posit their server get very few (<5%) request for packages for old versions of R (It might be that users install from source, use docker or already have other ways to deal with it). On the cost side you could keep last n versions of packages to reduce the storage requirements.

I think building arm64 binaries is something CRAN maintainers have in mind, as I think they briefly mentioned serving arm64 binaries in previous meetings with them. There is interest also from other Linux distributions to have linux binaries built from CRAN for arm64 (Fedora that I've heard of, but others too probably).

This is great to hear though I am wondering how this would unfold in practice: until today there are no CRAN binaries, even for amd64 only. Seeing an effort for multiple architectures across different distributions is highly welcome, though I don't have much hope this being tackled in due time with a modern and open approach (judging on the closed binary building system of today). Also I think that when starting something like this, the most commonly used distributions should be supported (including Alpine). I could see that this would only be approached for Ubuntu then (as a start) and others falling behind.

My proposal here is actually not related to "get help from CRAN" (again, this is not a proposal for a new project but a pre-release announcement of something already built) but rather to establish a new approach of building package binaries in the open. Using a modern, distributed underlying architecture with transparent statistics to which everyone could contribute to.

More practical ideas, I wouldn't built old package versions on default. From what I read from Posit their server get very few (<5%) request for packages for old versions of R (It might be that users install from source, use docker or already have other ways to deal with it). On the cost side you could keep last n versions of packages to reduce the storage requirements.

I thought about this when starting out. However, it is hard to judge where the cutoff point should be. In addition, the automation in place simply scrapes all package versions and tries to build them. Many of the old ones will error anyhow due to missing/archived R packages, incompatible compilers or missing sysdep entries in DESCRIPTION.

Yet again, the storage size is actually not that much of an issue due to the use of S3. Even a few TB are not that expensive (in contrast to storing this size on a cloud volume). More costs are spent on building historic versions that compile for many minutes and then fail. However, I plan to address this using the logs acquired during the initial builds and implementing a "smart judge" feature that decides whether a specific tag will be skipped entirely for future builds. With "future builds" meaning rebuilds for newer OS versions, e.g. when Ubuntu 2604 comes out. While Ubuntu and RHEL only have releases every few years, this system is more important for Alpine which releases a new version to build against every 6 months.

Maybe next meeting, November 11th at 17 CET, you could present this and get more feedback. If you don't get the invitation, I'll send it to you.

Sure, sounds like a good option.

My proposal here is actually not related to "get help from CRAN" (again, this is not a proposal for a new project but a pre-release announcement of something already built) but rather to establish a new approach of building package binaries in the open. Using a modern, distributed underlying architecture with transparent statistics to which everyone could contribute to.

Thanks for clarifying my misunderstanding, looking forward to see how you do it. In case it helps, most of CRAN system is public at https://svn.r-project.org/R-dev-web/trunk/. I think there is an issue or a PR in this repo about how they build the binaries or using the pipeline locally.

I thought about this when starting out. However, it is hard to judge where the cutoff point should be. In addition, the automation in place simply scrapes all package versions and tries to build them. Many of the old ones will error anyhow due to missing/archived R packages, incompatible compilers or missing sysdep entries in DESCRIPTION.

There is no need to scrap the data, R provides functionality to access the current ones with tools::CRAN_package_db() for current ones or all old packages with tools::CRAN_archive_db() (only on devel), but they are in CRAN's archive's and could be used to rebuild those that fail.

Looking forward to the learn more about the project.

There is no need to scrap the data, R provides functionality to access the current ones with tools::CRAN_package_db() for current ones or all old packages with tools::CRAN_archive_db() (only on devel), but they are in CRAN's archive's and could be used to rebuild those that fail.

Thanks. I already make use of tools::CRAN_package_db(), not yet of tools::CRAN_archive_db() (just saw it is an ::: which is why I likely didn't see it before). I currently use cranberries to infer the updated pkgs on a daily basis. For the sources I am using the GH mirror of all packages at https://github.com/cran.

WRT to build failures etc: I envision a transparent UI where everyone can see the build status and failures of packages. The idea is that users can check those and provide suggestions/patches to the underlying Dockerfiles (or their own sources) to solve the failures. E.g. during the process of building Alpine packages I've already submitted a few issues to certain R packages containing C code as their packages failed on Alpine. Turns out only a small change was needed due to MUSL/GLIBC and now it works for both. Given that so far nobody had a focus to check for Alpine it is not surprising that many packages fail to build. The new system would provide a platform for such builds and encourage authors to get their package built on Alpine.

👋🏻 Hi, I'm a random person from the internet who helps maintain a couple R packages, hope you don't mind the drive-by post.

Have you seen the r2u project (https://github.com/eddelbuettel/r2u)? It has some overlapping design goals to what you've described here, and might be worth looking at for inspiration.

It would also probably be useful to describe how what you're building differs from conda-forge (https://anaconda.org/r/repo), which is also building aarch64 binaries of many R packages (see the data.table variants here, for example), and has some of the features you've described (like "everyone can see the build status and failures of packages").

System dependencies are inferred via pak which make use of https://github.com/rstudio/r-system-requirements to automatically infer these from a package DESCRIPTION file.

This, in particular, is something that r2u handles differently. Paraphrasing here from a talk I recently saw @eddelbuettel give... I believe that that system does something like "build from source, then run ldd on the built library to determine which other shared libraries it'll need at runtime, then map backwards from those library filenames to package names".

Hi James,

r2u

Sure, r2u isn't a new project and around for some time. I got asked about it several times in my professional work (which evolves around R infrastructure since several years).

I don't think r2u solves any issues related to R packaging within the R community and (hot take) even makes it more complicated. Here's why:

Only built for amd64
Only available for one Linux family (Debian)
Updates must be done on the admin side/root as normal users can't use apt (users therefore don't even know when packages got updated)
You can't use any project versioning approaches (like renv or others) therefore as you can't install historic versions on your own
A well maintained OS updates system packages on a regular base (daily, weekly) which means R packages would change constantly

Therefore I always discourage the use of r2u when somebody asks me about my opionion. In addition, given the full availability of R package binaries through Posit PM for Ubuntu, I don't see any benefit of r2u even for Ubuntu users.

anaconda

Anacondas package compatibility is not well described. I.e. they provide a single "linux-64" build. However, this cannot work for all distros as each needs different versions during runtime and must be linked to local sysdeps. Yet I haven't tried it in practice on different distros and inspected it in close detail.

In addition, they don't built full CRAN but only a subset - see also here:

Many Comprehensive R Archive Network (CRAN) packages are available as conda packages. Anaconda does not provide builds of the entire CRAN repository, so there are some packages in CRAN that are not available as conda packages.

has some of the features you've described (like "everyone can see the build status and failures of packages").

This is "just" some additional niceness of the overall approach, not something that stands out. What matters to me is:

arch-agnostic packages
distribution-agnostic packages
historic versions
usage with install.packages()
a globally low latency for downloads and installation

@llrs Would you mind sending me an invitation for the upcoming meeting?

RConsortium / r-repositories-wg