Open yuvipanda opened 2 years ago
@yuvipanda Hi, just dropping in here! Is there a reason that you can't get the R packages from conda-forge? e.g. r-sf
?
Basically, for python packages, we should install them with conda via environment.yml if it exists in conda-forge, and use pip otherwise.
@agoose77 Most of the R community I know of would like to use install.package
or devtools
to install packages and manage them from CRAN, and I don't want to redirect them to a different method instead. From an optics perspective, conda
is often (fairly or unfairly!) seen as python centric, and given we're already fighting the perspective that JupyterHub is python centric (even though we offer RStudio in our hubs), I want to do everything I can to not have R users learn a different package management solution.
I see what you're saying. I'm not familiar with the R toolchain - is it possible to use different conda environments for RStudio vs the Python kernels?
@agoose77 most of our R users use R via RStudio, so conda and Jupyter kernels are completely uninvolved there.
@yuvipanda sure, let me clarify!
My understanding of your situation is:
proj
from the system environment. I am wondering whether it makes sense to drop the need for apt packages entirely by installing RStudio itself in a separate environment to Python, and then have your entry-point such that this is invisible to the user. This is just so that there is a clearer isolation / separation between the system and the application environments (RStudio, Python). Functionally this wouldn't be much different from using apt for the R dependencies, except that it keeps everything on Conda.
Both Python and R are installed in the same environment
Ah, so they're installed in the same Docker image, but R doesn't know anything about conda at all, so they aren't in the same 'conda environment'. The proposal in this issue uses conda for all Python, and R's native package installation (from CRAN) + apt for R. The scripts that R users distribute often have install.packages()
commands in them, and I don't want them to have to do something special instead. Hence avoiding getting anything R from conda for now.
Right! My suggestion was only that putting R inside a separate Conda env would allow you to avoid using APT for R, because install.packages
should still work within a Conda environment. The only difference is that you provide the necessary dependencies e.g. proj
via Conda inside that environment instead of from APT. It's mainly an idea to simplify the ergonomics so that your "environments" are distinct from the host :)
@agoose77 ah, ok - I'll consider that :) I'm somewhat quite reluctant to use conda for R, as I feel the general R community is much more focused on CRAN and apt than on conda. packagemanager.rstudio.com offers prebuilt binary packages for all of CRAN, while there are only a subset of packages available on conda-forge. In my ideal world, I'd not use conda for python packages either (so I don't have to mix them!) - and at least for now it looks like I can do that (avoid mixing!) with R.
In my ideal world, I'd not use conda for python packages either (so I don't have to mix them!) - and at least for now it looks like I can do that (avoid mixing!) with R.
Yeah, I don't like mixing my pip with conda-forge packages (and therefore tend to rely solely on PyPI). It would be nice if there were an abstraction layer for tools like poetry such that PyPI + conda-forge could be used by the tool.
Submitted PR for julia and asked @yuvipanda to review just to make sure we're on the same page.
dlab uses the datahub user image.
@yuvipanda looking at eecs hub, would https://anaconda.org/conda-forge/py-opencv be the same as opencv-python?
@felder Is this issue in scope for Fall 22 or should be moved to Spring 23 or is irrelevant at this juncture?
@balajialg could be in scope for Fall. My understanding is we're also going to update the base image and do some package management. Could be this gets done as part of that work.
@felder Sounds good. Thanks!
Currently, we get most of our python packages from pypi.org, installed via pip. A lot of scientific python packages have C extensions, and installing them from pypi has been simple enough thanks to manylinux wheels. However, there are some packages - particularly in the geo sciences - that are a pain in the ass to install this way still.
https://github.com/berkeley-dsep-infra/datahub/issues/2824 is one such case. The cartopy project does not ship manylinux wheels, so we need to install its C dependencies - proj, geos, gdal, etc - from apt. This also has knock-on effect for other packages that depend on proj, like shapely. It does have binary wheels, but because cartopy and shapely must link to the same proj library, it must be built from source too - or you run into problems with https://github.com/berkeley-dsep-infra/datahub/issues/1796.
This becomes even more complicated when we add R to the mix. The sf R package also needs proj, and since we're installing it from packagemanager.rstudio.com, it's linked against the version of proj that is available in apt.
So to recap, the following package managers are involved:
This was a bit of a tenuous situation, but the need to upgrade cartopy for https://github.com/berkeley-dsep-infra/datahub/issues/2824 totally made this unworkable. Cartopy 0.20 needed a newer version of proj than what was available in apt. With https://github.com/berkeley-dsep-infra/datahub/pull/2826, we tried to install a newer version of proj from conda (adding yet another package manager to the mix!), but this required we remove proj installed via apt - as otherwise pip was still trying to link to that, and that doesn't work. And once we removed proj from apt, this broke the R
sf
package, as it required proj from apt!I think the core of the problem is that both pip and R are dependent on apt for some C libraries, and this can conflict. I propose instead that we:
The scientific python ecosystem has a lot of good support for conda, so I think this will also simplify our lives a bit. We'll still be getting some python python packages from pip, but as long we're getting most packages that link against C libraries from conda, I think we're ok.
Let's move these one hub image at a time, starting with the easiest.
environment.yml
If we get similar versions from conda that we get from pip right now, I think this would work out ok. Should also be faster to do builds