berkeley-dsep-infra / datahub

JupyterHubs for use by Berkeley enrolled students
https://docs.datahub.berkeley.edu
BSD 3-Clause "New" or "Revised" License
62 stars 37 forks source link

Install most packages from conda-forge, instead of pypi #2934

Open yuvipanda opened 2 years ago

yuvipanda commented 2 years ago

Currently, we get most of our python packages from pypi.org, installed via pip. A lot of scientific python packages have C extensions, and installing them from pypi has been simple enough thanks to manylinux wheels. However, there are some packages - particularly in the geo sciences - that are a pain in the ass to install this way still.

https://github.com/berkeley-dsep-infra/datahub/issues/2824 is one such case. The cartopy project does not ship manylinux wheels, so we need to install its C dependencies - proj, geos, gdal, etc - from apt. This also has knock-on effect for other packages that depend on proj, like shapely. It does have binary wheels, but because cartopy and shapely must link to the same proj library, it must be built from source too - or you run into problems with https://github.com/berkeley-dsep-infra/datahub/issues/1796.

This becomes even more complicated when we add R to the mix. The sf R package also needs proj, and since we're installing it from packagemanager.rstudio.com, it's linked against the version of proj that is available in apt.

So to recap, the following package managers are involved:

  1. apt to get C libraries (proj)
  2. pip to get Pyton packages, that link against the C libraries (shapely, cartopy)
  3. R packages coming in via packagemanager.rstudio.com, that also link against C libraries specifically coming from apt (sf)

This was a bit of a tenuous situation, but the need to upgrade cartopy for https://github.com/berkeley-dsep-infra/datahub/issues/2824 totally made this unworkable. Cartopy 0.20 needed a newer version of proj than what was available in apt. With https://github.com/berkeley-dsep-infra/datahub/pull/2826, we tried to install a newer version of proj from conda (adding yet another package manager to the mix!), but this required we remove proj installed via apt - as otherwise pip was still trying to link to that, and that doesn't work. And once we removed proj from apt, this broke the R sf package, as it required proj from apt!

I think the core of the problem is that both pip and R are dependent on apt for some C libraries, and this can conflict. I propose instead that we:

  1. Use conda to get most scientific python packages, especially any that have C dependencies. This completely removes the need for C packages from apt for the most part
  2. Use apt to get C packages needed by R packages.

The scientific python ecosystem has a lot of good support for conda, so I think this will also simplify our lives a bit. We'll still be getting some python python packages from pip, but as long we're getting most packages that link against C libraries from conda, I think we're ok.

Let's move these one hub image at a time, starting with the easiest.

If we get similar versions from conda that we get from pip right now, I think this would work out ok. Should also be faster to do builds

agoose77 commented 2 years ago

@yuvipanda Hi, just dropping in here! Is there a reason that you can't get the R packages from conda-forge? e.g. r-sf?

yuvipanda commented 2 years ago

Basically, for python packages, we should install them with conda via environment.yml if it exists in conda-forge, and use pip otherwise.

yuvipanda commented 2 years ago

@agoose77 Most of the R community I know of would like to use install.package or devtools to install packages and manage them from CRAN, and I don't want to redirect them to a different method instead. From an optics perspective, conda is often (fairly or unfairly!) seen as python centric, and given we're already fighting the perspective that JupyterHub is python centric (even though we offer RStudio in our hubs), I want to do everything I can to not have R users learn a different package management solution.

agoose77 commented 2 years ago

I see what you're saying. I'm not familiar with the R toolchain - is it possible to use different conda environments for RStudio vs the Python kernels?

yuvipanda commented 2 years ago

@agoose77 most of our R users use R via RStudio, so conda and Jupyter kernels are completely uninvolved there.

agoose77 commented 2 years ago

@yuvipanda sure, let me clarify!

My understanding of your situation is:

I am wondering whether it makes sense to drop the need for apt packages entirely by installing RStudio itself in a separate environment to Python, and then have your entry-point such that this is invisible to the user. This is just so that there is a clearer isolation / separation between the system and the application environments (RStudio, Python). Functionally this wouldn't be much different from using apt for the R dependencies, except that it keeps everything on Conda.

yuvipanda commented 2 years ago

Both Python and R are installed in the same environment

Ah, so they're installed in the same Docker image, but R doesn't know anything about conda at all, so they aren't in the same 'conda environment'. The proposal in this issue uses conda for all Python, and R's native package installation (from CRAN) + apt for R. The scripts that R users distribute often have install.packages() commands in them, and I don't want them to have to do something special instead. Hence avoiding getting anything R from conda for now.

agoose77 commented 2 years ago

Right! My suggestion was only that putting R inside a separate Conda env would allow you to avoid using APT for R, because install.packages should still work within a Conda environment. The only difference is that you provide the necessary dependencies e.g. proj via Conda inside that environment instead of from APT. It's mainly an idea to simplify the ergonomics so that your "environments" are distinct from the host :)

yuvipanda commented 2 years ago

@agoose77 ah, ok - I'll consider that :) I'm somewhat quite reluctant to use conda for R, as I feel the general R community is much more focused on CRAN and apt than on conda. packagemanager.rstudio.com offers prebuilt binary packages for all of CRAN, while there are only a subset of packages available on conda-forge. In my ideal world, I'd not use conda for python packages either (so I don't have to mix them!) - and at least for now it looks like I can do that (avoid mixing!) with R.

agoose77 commented 2 years ago

In my ideal world, I'd not use conda for python packages either (so I don't have to mix them!) - and at least for now it looks like I can do that (avoid mixing!) with R.

Yeah, I don't like mixing my pip with conda-forge packages (and therefore tend to rely solely on PyPI). It would be nice if there were an abstraction layer for tools like poetry such that PyPI + conda-forge could be used by the tool.

felder commented 2 years ago

Submitted PR for julia and asked @yuvipanda to review just to make sure we're on the same page.

felder commented 2 years ago

dlab uses the datahub user image.

felder commented 2 years ago

@yuvipanda looking at eecs hub, would https://anaconda.org/conda-forge/py-opencv be the same as opencv-python?

balajialg commented 2 years ago

@felder Is this issue in scope for Fall 22 or should be moved to Spring 23 or is irrelevant at this juncture?

felder commented 2 years ago

@balajialg could be in scope for Fall. My understanding is we're also going to update the base image and do some package management. Could be this gets done as part of that work.

balajialg commented 2 years ago

@felder Sounds good. Thanks!