conda-forge / matplotlib-feedstock

A conda-smithy repository for matplotlib.
BSD 3-Clause "New" or "Revised" License
22 stars 57 forks source link

Split into meta packages #2

Closed pelson closed 6 years ago

pelson commented 8 years ago

It would be great to split this recipe into several meta-packages. That way, matplotlib doesn't have to depend on tk / qt etc. in the same way that jupyter doesn't pull in all of the dependencies needed for the notebook.

jankatins commented 8 years ago

See also https://github.com/conda/conda/issues/793

pelson commented 8 years ago

Thanks @JanSchulz. I'd seen that before, but I like your recent comment.

frol commented 8 years ago

Here are some discussions on this matter in conda-recipes repository:

Here is also my recent update for our internal use since we care about not having libX.so on our headless servers/containers: https://github.com/salford-systems/conda-recipes/tree/master/matplotlib-nogui (here is the repo: http://anaconda.org/salford_systems/matplotlib-nogui)

ChrisBarker-NOAA commented 7 years ago

Would it be so bad to no have qt as a dep -- and folks simply need to install qt id they want to use the qt back-end?

tk should be there, as we want SOME default UI.

epruesse commented 7 years ago

Bump....

How would you want to go about this?

ocefpaf commented 7 years ago

mess with multiple outputs

This is probably the way we should go.

epruesse commented 7 years ago

This is probably the way we should go.

sigh

I agree, but don't like it.

ocefpaf commented 7 years ago

I agree, but don't like it.

Please elaborate. We should weight all our options. (And if possible fix the issues with the outputs options in case there are any.)

epruesse commented 7 years ago

Well, primarily because it's way more work. The way outputs: works in at least conda <3 makes it tedious to use. I'll also have to build matplotlib with all modules enabled and then split it by hand and hope the tests catch all issues. And I'm not familiar enough with it to tell whether there are internal flags set, plus the manual splitting is something that needs to be reviewed with every upstream release.

I do agree that the cleanest way would be to have

(The latter because existing packages may depend on matplotlib coming with everything)

ocefpaf commented 7 years ago

Well, primarily because it's way more work.

I haven't look into that yet but I believe you. I did see some examples that seemed to have more boilerplate than it should. (CAVEAT I am biased to how things are done with RPMs.)

The way outputs: works in at least conda <3 makesI it tedious to use. I'll also have to build matplotlib with all modules enabled and then split it by hand and hope the tests catch all issues.

I guess that this would be true somehow in the other approaches.

I used to package matplotlib for OpenSUSE exactly like you propose above. In my experience it was not worth it. Maybe only the two packages: -core and all backends? What do you think?

jakirkham commented 7 years ago

What about building matplotlib with everything and then using run_constrained to pin them correctly when they are installed? This would result in the following: all of backends are optional, all backends can work, pinnings are correct when backends are installed, and no backends shipped by default. Will require conda-build 3, but we do need to switch at some point.

cc @msarahan

epruesse commented 7 years ago

I used to package matplotlib for OpenSUSE exactly like you propose above. In my experience it was not worth it. Maybe only the two packages: -core and all backends? What do you think?

I concur. matplotlib with everything and matplotlib-nogui would cover all use-cases that come to mind. That's why I seriously considered the "KISS" approach of a second package.

epruesse commented 7 years ago

@jakirkham Can we build with conda-build 3 yet? Will those packages work with conda 2.* installations?

ocefpaf commented 7 years ago

That's why I seriously considered the "KISS" approach of a second package.

We can bypass the conda-build pinning and use conda-build 3 in a case-by-case basis until we move completely to conda-build 3.

I understand your thinking entirely but I believe we should give outputs a shot and send some feedback upstream. If a new package will be indeed always easier then we can drop outputs.

(I am interesting in doing that myself, just don't know when I'll have the time to do it.)

epruesse commented 7 years ago

I'm not sure run_constrained helps a lot here. conda install matplotlib-core pyqt5 to get a working backend is anything but straightforward, so a package including the GUI backends is required anyway, even if only as meta-package. The binary parts need to go somewhere too, and for size reasons preferably not as ballast into the non-gui package, so we'd end up with two packages either way, wouldn't we?

jakirkham commented 7 years ago

Was thinking about that too. Though python always pulls in tk. So one would end up with a tk-backend by default anyways.

jakirkham commented 7 years ago

That said, starting with a matplotlib-nogui package would be totally reasonable and could easily be integrated into any of these solutions that we settle on and develop later.

epruesse commented 7 years ago

I think it'll have to be outputs:. Conda has no provides: or conflicts: type dependency rules, so the two packages can't occupy the same namespace within site-libs, plus you wouldn't want packages depending on the GUI backends and those not requiring them to be mutually exclusive.

ChrisBarker-NOAA commented 7 years ago

First -- I apologize for having no ida what "outputs" does...

But -- what is the goal of splitting it up?

possible goals:

1) making the packages smaller and easier to build -- that's going to require really splitting things up -- I'm guessing more work that it's worth. I"d probably only do that if MPL itself wants to better support that.

2) making the dependencies lighter weight: -- i.e. if I am using matplotlib for a headless application, I really shouldn't have to install pyqt, etc.

For that use case, we can still have a single complete package, say matplotlib_nogui, and then have other packages that are only dependencies. so you install:

matplotlib_qt

And it has only dependencies: matplotlib_nogui pyqt ...

or maybe only have a "matplotlib" package that depends on everything - that's what it will have to be in the short term anyway.

Though I'd love to go with conda install "matplotlib" and NOT get qt or anything optional

To some extent, I'm not sure it's wrong for the matplotlib package NOT to have the dependency on qt or GTK, or wx, or... if you are building an application or package with QT, you are going to know that, and can add the dependency yourself.

msarahan commented 7 years ago

outputs are conda-build's way of emitting more than one package from a single recipe. Docs are at https://conda.io/docs/user-guide/tasks/build-packages/define-metadata.html#outputs-section

Conda 4.4 will add some support for optional dependencies. More info at https://github.com/conda/conda/issues/3299 and https://github.com/conda/conda/pull/4982

ChrisBarker-NOAA commented 7 years ago

Conda 4.4 will add some support for optional dependencies

nice! I would like that.

Thinking a bit more -- do the GUI back-ends add much overhead at all anyway? i.e. is there any reason NOT to build MPL with support for all of them?

IIRC, it was a pain back in the day to build MPL with wx support, because you needed the right wx installed, and then it was a hard dependency on that version (and before conda or even wheels, that was a major pain!).

But once we are set up to build the full thing -- it didn't add a big binary or anything -- the big stuff is the GUI package itself.

-CHB

jakirkham commented 7 years ago

Yes, they pull in a lot of extra dependencies. This is painful for applications that have narrow size limits. That said, I would be ok building with all of the GUI dependencies as long as they can be made optional somehow.

jakirkham commented 7 years ago

I think it'll have to be outputs:. Conda has no provides: or conflicts: type dependency rules, so the two packages can't occupy the same namespace within site-libs, plus you wouldn't want packages depending on the GUI backends and those not requiring them to be mutually exclusive.

IDK could have matplotlib depend on matplotlib-nogui and use always_include_files in matplotlib to forcibly overwrite (relevant or all) contents of matplotlib-nogui.

epruesse commented 7 years ago

@ChrisBarker-NOAA

In numbers: 830MB for matplotlib total, of that 400MB for MKL (a dozen binaries specific to various architectures) and 230MB Qt5, sometimes a lot more, depending on how conda resolves the chain.

It's also not just the space occupied that causes issues: The dependency resolver in conda is horribly slow and can easily get stuck trying to find a solution in a large constraint tree. Less dependencies lowers the risk of making it try to find a needle in a haystack. Another issue is the sheer number of files. On an SSD backed workstation it's hardly noticeable, but linking or copying thousands of files to create an environment takes quite a while on NFS or on CI systems.

Things like diligently splitting packages into very small sub-packages, stripping the debug symbols from libraries and placing them in separate packages, compressing large files, etc. sound unimportant when looking at e.g. Debian from the outside. With a "distro" that doesn't do all this rigorously, though, you install "just the bare necessities" (numpy, pandas, mkl, pip, matplotlib) and you've filled nearly 2GB.

Still not an issue on the laptop or workstation, but in a docker run on a CI service this adds up quickly. I've got a project where installing the conda environments costs almost half an hour. At 1500 free minutes a month that's only 50 builds per month I can do before I need to start spending money.

epruesse commented 7 years ago

@jakirkham I'll try to split the package with outputs:. The always_include_files way frightens me a little, it would at the very least have to use the run_constrained mechanism to make sure we don't get mixed versions of matplotlib installed.

jakirkham commented 7 years ago

The always_include_files way frightens me a little, it would at the very least have to use the run_constrained mechanism to make sure we don't get mixed versions of matplotlib installed.

Could you elaborate a bit on why run_constrained would be needed here? To be clear, am proposing that matplotlib would always depend on matplotlib-nogui.

As a side note, matplotlib-nogui is a misnomer given tk is around. Maybe it should be called matplotlib-tk instead?

epruesse commented 7 years ago

Two packages with one essentially being a subset of the other would just add more space consumed ($CONDA/pkgs/...). Also, if it actually is a subset, splitting should not be too difficult. There may be settings regarding available backends stored somewhere. And if that is the case, the order in which the packages are installed matters if they overwrite one another.

epruesse commented 7 years ago

Regarding package names, I see the rationale that tk would always be there, but package authors are users too, and matplotlib vs matplotlib-tk makes the uninitated choose the former. Whereas matplotlib-nogui is an easy choice for any tool that doesn't offer a gui (even if the gui will work with tk). matplotlib-minimal sounds like it's missing things.

So matplotlib-core maybe, with matplotlib installing both matploglib-gui-backends and matplotlib-core?

ChrisBarker-NOAA commented 7 years ago

On Mon, Sep 11, 2017 at 5:48 PM, Elmar Pruesse notifications@github.com wrote:

Regarding package names, I see the rationale that tk would always be there,

If we can strip out qt, etc, and it's worth doing, then why not strip out the tk back-end too?

makes the uninitated choose the former. Whereas matplotlib-nogui is an easy choice for any tool that doesn't offer a gui (even if the gui will work with tk). matplotlib-minimal sounds like it's missing things.

agreed.

So matplotlib-core maybe, with matplotlib installing both matploglib-gui-backends and matplotlib-core?

matplotlib-core would be OK, too, but that almost sounds incomplete.

-CHB

ChrisBarker-NOAA commented 7 years ago

@epruesse: yes, install size really does matter -- exactly why I don't want f-ing QT installed in most of my environments, even though I never use it!

But how big is the actual back-end, vs the dependencies -- both qt and wx are pretty darn large (and pyGTK?), but I'm not sure how large the back-end in MPL are. I'll go look...

Now I'm confused, I thought MPL installed qt be default, but just now I tried in, and got:

The following NEW packages will be INSTALLED:

cycler:          0.10.0-py36_0 conda-forge
freetype:        2.7-1         conda-forge
libpng:          1.6.28-1      conda-forge
matplotlib:      2.0.2-py36_2  conda-forge
mkl:             2017.0.3-0               
numpy:           1.13.1-py36_0            
python-dateutil: 2.6.1-py36_0  conda-forge
pytz:            2017.2-py36_0 conda-forge
tornado:         4.5.2-py36_0  conda-forge

and, well, we'll need all those (except maybe tornado -- what is that used for? I know it's used by jupyter, and thus probably the mpl inline, but by itself?? the web backend?

anyway, that's a lot -- is removing the back-end code going to help in any meaningful way? I just created a new environment, with just matplotlib (and all its deps) on OS-X:

558MB !!

The backends dir is 2.5MB --kinda big, but 0.4% of the whole install, and probably half of that is required anyway.

So what problem are we trying to solve??

I add ipython, and it's 603 MB -- so another 45 MB for ipython

and still no qt -- maybe that was fixed -- thanks!

Jupyter, on the other hand, brings in qt and pyqt (and the qtconsole)--anyone know why?!?

qt is by far the biggest part of that (ICU pretty big, too)

now up to 1.23GB -- so another 600MB! a lot of that QT.

So we really need to not install QT by default with jupyter!

-CHB

jakirkham commented 7 years ago

So matplotlib-core maybe, with matplotlib installing both matploglib-gui-backends and matplotlib-core?

đź‘Ť

mingwandroid commented 7 years ago

@ChrisBarker-NOAA, are you using hardlinks here? These costs should be only once.

I am not saying I'm opposed to splitting things, but it needs to be done with razor sharp precision and attention to detail.

ChrisBarker-NOAA commented 7 years ago

@ChrisBarker-NOAA https://github.com/chrisbarker-noaa, are you using hardlinks here? These costs should be only once.

Sure -- that's the beauty of conda environments :-)

But I was responding to another post that size matters :-) -- and not only absolute size, but number of links.

I am not saying I'm opposed to splitting things, but it needs to be done with razor sharp precision and attention to detail.

I'm not either, but I'm still quite confused as to what we are gaining -- it looks like pretty trivial gains to me. And non-trivial gain.

But once figured out, probably not a big deal to maintain, so if someone wants to do it -- why not?

epruesse commented 7 years ago

Hardlinks don't work on all filesystems. Symlinks don't work with all tools. So if you've got a system conda installation or in some docker cases or if you use things like snakemake which place the environments with the data you're down to copying.

mingwandroid commented 7 years ago

People, please never ever use always_include_files to allow you to create packages with overlapping files.

As soon as you install both then remove one, the overlapping file is removed.

In fact, conda-build errors out if you try to create split packages with overlapping files for this reason (and others).

epruesse commented 7 years ago

@mingwandroid Yes, that one is dangerous. There is a reason why dpkg has a database to know which files belong to which package and refuses to install over existing files from another package, and why Debian has that awkward alternatives symlinking mechanism to deal with cases where packages need to occupy overlapping filesystem namespace.

nanoant commented 6 years ago

Any update on this? Seems we are ready for optional deps as this has been merged https://github.com/conda/conda/pull/4982

marcelm commented 6 years ago

If you simply want to reduce size and installation time, it helps a lot to avoid the MKL packages (@epruesse mentioned these above) by installing the nomkl package/feature along with matplotlib.

On my Linux system, an environment created with conda create -n mpl matplotlib takes up 1.5 GB, but with conda create -n mpl matplotlib nomkl, it occupies 690 MB. I don’t know how to easily check download sizes, but I assume they would also be cut roughly in half. It’s possible this works only on Linux.

jakirkham commented 6 years ago

Maybe we can build with all backends and use run_constrained to ensure correct version constraints are matched if those optional backends are installed. Thoughts?

rth commented 6 years ago

In #157 a setup without PyQt was proposed. Could it be worth submitting that PR as a new feedstock for matplotlib-core as a starting point? This would address this issue for people wanting to use another Qt backend. Then update this feedstock to depend on matplotlib-core (and eventually incorporate other ideas proposed above)?

marcelm commented 6 years ago

I’m actually suggesting to not create a matplotlib-core package, just to drop the pyqt dependency from the normal matplotlib package. The Tk backend would always be available for anyone who wants to do interactive plotting. (As argued above, Tk is a dependency of Python anyway, so it wouldn’t make sense to split it out.)

rth commented 6 years ago

Yes, pip install matplotlib does not pull PyQt, but still, I have a feeling that having a matplotlib-core package would be more consistent with the default conda channel, and other packages in the conda-forge ecosystem that went that way (e.g. https://github.com/conda-forge/dask-feedstock/issues/22#issuecomment-318903133 ).

Having matplotlib not pull PyQt will have as practical effect to change the backend from PyQt5 to Tk for most conda-forge users, and I'm not sure that's desirable.

jakirkham commented 6 years ago

That's why I'm suggesting build in the presence of all backends and ensure their requirements are matched should users install them, but don’t require any of them. Thus users wanting PyQt5 can explicitly request it.

Part of the issue with matplotlib-core is many packages require matplotlib. So breaking out matplotlib doesn’t actually solve the issue if these downstream packages are needed.

marcelm commented 6 years ago

I noticed that the presence of the pyqt module at build time doesn’t actually make a difference to the resulting package. Matplotlib always builds the backend (or rather: installs the necessary .py files). I assume this is the case also for the other interactive backends. This possible makes things a bit simpler.

I tested this by removing pyqt from the build requirements: The only difference (in non-.so files) was that the default backend in the matplotlibrc file was TkAgg in one case and Qt5Agg in the other.

It probably still makes sense to have the backend(s) in the build environment so matplotlib can check their versions.

marcelm commented 6 years ago

Here’s what the consequences of merging #157 would be:

Downsides are:

I’d like to focus on the Tk and Qt5Agg backends at the moment since that is what the current recipe supports.

rth commented 6 years ago

I noticed that the presence of the pyqt module at build time doesn’t actually make a difference to the resulting package.

Nice, I was wondering about that.

The only difference (in non-.so files)

That was pretty much my question in https://github.com/conda-forge/matplotlib-feedstock/issues/155 , are we sure there the backend available at the build time doesn't impact the generated .so files? Else why would this feedstock include PyQt at buid time (apart for the matplotlibrc generation) . I mean it seems to work in practice but is this the right thing to do?

was that the default backend in the matplotlibrc file was TkAgg in one case and Qt5Agg in the other.

Wouldn't that mean that even if PyQt5 is installed, the default backend will still be Tk, so users would need to modify their code to use PyQt5 backend?

Scripts relying on matplotlib’s default interactive backend being Qt5Agg break. I don’t know how this is even possible. The only scenario I can come up with is a PyQt application that embeds matplotlib widgets, but I guess even then it would use the Qt5Agg backend explicitly.

A rough estimate could be the ratio of the results of the "import+matplotlib.backends+qt" Github search query to the "matplotlib" one. Probably overestimated but that could be of the order of 33k / 2.4M ~ 1.3% -- possibly low enough to ignore..

marcelm commented 6 years ago

The only difference (in non-.so files) That was pretty much my question in #155 , are we sure there the backend available at the build time doesn't impact the generated .so files?

All .so files were different between the two builds, even those not related to backends. I’m assuming that a timestamp is embedded in the files. However, there’s no .so file related to the Qt backends anyway, everything appears to be in pure Python. And I did verify that it works to install PyQt/PySide/PySide2 afterwards, so I think we’re good.

If I interpret the setup routines in matplotlib 2.2.2 correctly, then the only extension for which an .so is compiled is GTKAgg and and that one is deprecated and was removed recently.

Wouldn't that mean that even if PyQt5 is installed, the default backend will still be Tk, so users would need to modify their code to use PyQt5 backend?

Yes, this is correct. I noticed that, too. Note that this problem exists no matter how the package is split: Even a matplotlib-core package would need to come with a matplotlibrc configuration file that sets a backend that is not Qt. A matplotlib-qt that has matplotlib-core as a dependency would not be able to override it.

The python3-matplotlib package in Ubuntu 18.04 sets TkAgg as default backend, by the way.

Probably overestimated but that could be of the order of 33k / 2.4M ~ 1.3% -- possibly low enough to ignore.

Thanks for looking into these statistics! I have to admit I’m biased as I only ever use matplotlib non-interactively or within Jupyter notebooks, but I would also say this is acceptable.

rth commented 6 years ago

Thanks for investigating this @marcelm !

Wouldn't that mean that even if PyQt5 is installed, the default backend will still be Tk Yes, this is correct. I noticed that, too. Note that this problem exists no matter how the package is split.

This is problematic IMO: maybe patching setup.py in

https://github.com/matplotlib/matplotlib/blob/ff6786446953931afe9491a61859f055232d7ca2/setup.py#L232-L232

to use Qt5 by default, then if it's not found let matplotlib fall back to Qt4, Pyside, Tk, etc using it's backend detection mechanism could be a solution?

jakirkham commented 6 years ago

The matplotlibrc typically lives in the user's home directory (though it can live other places as well. So users should be able to easily override the default backend this way. Not to mention, user codes can already run matplotlib.use to change this.

ref: https://matplotlib.org/users/customizing.html#the-matplotlibrc-file

nehaljwani commented 6 years ago

both qt and wx are pretty darn large

I can talk about wx. The reason why those binaries are so large in Conda Forge, is because by default the build is a debug build. While building it, I disable the debug flag. See https://github.com/AnacondaRecipes/wxpython-feedstock/blob/4.0.2/recipe/0003-Don-t-enable-debug-info-for-all-builds.patch . The binaries on defaults are half the size compared to the ones in conda-forge and the tarball size is less than one-third.

Opened an issue: https://github.com/conda-forge/wxpython-feedstock/issues/21