conda-forge / kerchunk-feedstock

A conda-smithy repository for kerchunk.
BSD 3-Clause "New" or "Revised" License
0 stars 4 forks source link

Pull in `h5py` (and other direct deps) at conda package install point #4

Closed valeriupredoi closed 2 years ago

valeriupredoi commented 2 years ago

Hey guys, we're installing kerchunk in our conda/mamba environment, and I was expecting h5py, as a direct dependency of kerchunk, to be pulled in by the package at install point - proves out it's not the case, could you do that please (add it to the reqs/recipe), and make sure that all the direct dependencies are pulled in when you build your conda package so we don't have to add them as direct deps of our package please? See eg a failed test showing that. Also, let me know if I can help with this, I can open a PR, and cheers muchly for your good work on Kerchunk/pkg maintenance overall :beer:

martindurant commented 2 years ago

I would say it's not quite as clear as that. Kerchunk can be useful for a number of file formats, so you might well not need h5py in your workflow.

On the other hand, kerchunk is designed to be used by data providers and people who intimately understand the datasets being processed - whereas users are at the intake/xarray/zarr/fsspec API normally.

On the other other hand, kerchunk, the repo, contains some numcodecs codecs, which might be needed for some specific datasets (e.g., all grib); so normal users would end up installing all the file format drivers after all.

lsterzinger commented 2 years ago

Would it be possible to do something like xarray's dependency sets? We could do something like conda install kerchunk[hdf5] / kerchunk[grib] / kerchunk[geotiff] etc and have a kerchunk[full] that pulls all dependencies required for all kerchunk functionality

valeriupredoi commented 2 years ago

@martindurant cheers, and I understand your point, but I'd argue something along the lines of @lsterzinger 's suggestion (a very elegant solution btw, we used to do that too before we unified all the sub-packages in one, fully deployed on conda-forge) would be necessary - think of this scenario: a user installs kerchunk against their preferred h5py installation, that could be from PyPi (and not from conda-forge), a version or two older than what you tested with, and stuff's starting to break in no time (probably not the case since h5py/HDF5 are pretty solid, but you never know, issues with PyPi wheels vs conda forge compiled code are dime a dozen); another situation is if a package (like ours) needs to pull kerchunk and, in turn, be pulled, itself by another package that needs a certain version of h5py, that you guys have restricted on, that'll result in kaboom too, and under the hood, instead of a nice mamba message saying either conflict bla or, it solving for the common version that both you and that other package can accept. Anyways, you get my point - package dependency resolution is akin an angry honey badger, don't provoke it while you still can :grin:

martindurant commented 2 years ago

For completeness, the CI environment currently includes:

valeriupredoi commented 2 years ago

apologies gents, I lost track of this for a while - nice env @martindurant :+1: So if you're pulling in cfgrib then you can probably add h5py there too, methinks? Up to you guys to decide, am not trying to impose my packaging preferences by no means; we've put in a nice conda lock file in our package so we now should be immune to issues from dependencies (in theory, at least :grin: )

martindurant commented 2 years ago

if you're pulling in cfgrib then you can probably add h5py there too, methinks

This is the set of packages for testing, where obviously it's necessary to test as much of the codebase as we can, including all the bits that need extra packages. It would be a fine template if we decided that this recipe should be an all-inclusive one in terms of deps. In practice, I don't expect cfgrib and astropy to appear in the same environment ever, except mine!

valeriupredoi commented 2 years ago

haha since astronomers don't use grib :grin: (some actually do, but they much prefer HDF5)

martindurant commented 2 years ago

For reference, the py39 test environment takes up 783MB for a fresh install on my system. I suppose that's not too bad.

valeriupredoi commented 2 years ago

no not at all, not too light not too heavy either, actually :+1: For reference, an uncached test env for one of our packages pulls in about 1GB worth of deps

martindurant commented 2 years ago

One suggestion: make three conda packages in this feedstock

martindurant commented 2 years ago

Maybe the first two can be the same for the time being...

lsterzinger commented 2 years ago

If we go down that route then anyone who currently has kerchunk installed will get all the dependencies on an upgrade, no? Maybe kerchunk == kerchunk-minimal and there's a kerchunk-complete that includes all deps

martindurant commented 2 years ago

That's a good point, @lsterzinger . We can make the new package with extra deps, and of course add words to the doc to point people at it. Whether they will find that, though...

lsterzinger commented 2 years ago

How difficult would it be to create a dependency set for each storage format? I imagine most users are interested in 1 or 2 maximum, so I don't know if it really makes sense to make the two main options "all of the dependencies for all storage formats" or "none of the dependencies, figure it out for yourself". At the very least, I see the HDF5/NetCDF being popular enough to warrant it's own dependency set if we go down that path

martindurant commented 2 years ago

Having a load of related packages seems like risky territory - we will eventually have many formats that we support. That's why pip selectors don't really work - by the time you find the docs saying which is for what, you could have installed your specific needs already. Then we're back to the current minimalist install and decent import error messages. Maybe we can improve those error messages at the time each submodule is accessed...

lsterzinger commented 2 years ago

Yeah I think even just a simple "Package X is needed for kerchunking Y format files, install with conda install X" would be good. I assume most users kerchunking datasets are at least moderately experienced with Python and wouldn't be thrown off by a "no module named X" error

martindurant commented 2 years ago

Following https://github.com/fsspec/kerchunk/pull/227, putting this on the back burner - closing for now. We may revisit in the future.

valeriupredoi commented 2 years ago

sounds good, cheers gents! :beer: