Closed pelson closed 6 years ago
Thanks @JanSchulz. I'd seen that before, but I like your recent comment.
Here are some discussions on this matter in conda-recipes repository:
Here is also my recent update for our internal use since we care about not having libX.so
on our headless servers/containers: https://github.com/salford-systems/conda-recipes/tree/master/matplotlib-nogui (here is the repo: http://anaconda.org/salford_systems/matplotlib-nogui)
Would it be so bad to no have qt as a dep -- and folks simply need to install qt id they want to use the qt back-end?
tk should be there, as we want SOME default UI.
Bump....
How would you want to go about this?
matplotlib-nogui
mess with multiple outputs
This is probably the way we should go.
This is probably the way we should go.
sigh
I agree, but don't like it.
I agree, but don't like it.
Please elaborate. We should weight all our options. (And if possible fix the issues with the outputs options in case there are any.)
Well, primarily because it's way more work. The way outputs:
works in at least conda <3
makes it tedious to use. I'll also have to build matplotlib
with all modules enabled and then split it by hand and hope the tests catch all issues. And I'm not familiar enough with it to tell whether there are internal flags set, plus the manual splitting is something that needs to be reviewed with every upstream release.
I do agree that the cleanest way would be to have
matplotlib-core
(no backends) matplotlib-qt5
matplotlib-qt4
matplotlib-gtk2
matplotlib-gtk3
matplotlib-tk
matplotlib
depending on all of them(The latter because existing packages may depend on matplotlib coming with everything)
Well, primarily because it's way more work.
I haven't look into that yet but I believe you. I did see some examples that seemed to have more boilerplate than it should. (CAVEAT I am biased to how things are done with RPMs.)
The way
outputs:
works in at leastconda <3
makesI it tedious to use. I'll also have to buildmatplotlib
with all modules enabled and then split it by hand and hope the tests catch all issues.
I guess that this would be true somehow in the other approaches.
I used to package matplotlib for OpenSUSE exactly like you propose above. In my experience it was not worth it. Maybe only the two packages: -core
and all backends? What do you think?
What about building matplotlib
with everything and then using run_constrained
to pin them correctly when they are installed? This would result in the following: all of backends are optional, all backends can work, pinnings are correct when backends are installed, and no backends shipped by default. Will require conda-build
3, but we do need to switch at some point.
cc @msarahan
I used to package matplotlib for OpenSUSE exactly like you propose above. In my experience it was not worth it. Maybe only the two packages: -core and all backends? What do you think?
I concur. matplotlib
with everything and matplotlib-nogui
would cover all use-cases that come to mind. That's why I seriously considered the "KISS" approach of a second package.
@jakirkham Can we build with conda-build 3 yet? Will those packages work with conda 2.* installations?
That's why I seriously considered the "KISS" approach of a second package.
We can bypass the conda-build pinning and use conda-build 3 in a case-by-case basis until we move completely to conda-build 3.
I understand your thinking entirely but I believe we should give outputs
a shot and send some feedback upstream. If a new package will be indeed always easier then we can drop outputs
.
(I am interesting in doing that myself, just don't know when I'll have the time to do it.)
I'm not sure run_constrained
helps a lot here. conda install matplotlib-core pyqt5
to get a working backend is anything but straightforward, so a package including the GUI backends is required anyway, even if only as meta-package. The binary parts need to go somewhere too, and for size reasons preferably not as ballast into the non-gui package, so we'd end up with two packages either way, wouldn't we?
Was thinking about that too. Though python
always pulls in tk
. So one would end up with a tk
-backend by default anyways.
That said, starting with a matplotlib-nogui
package would be totally reasonable and could easily be integrated into any of these solutions that we settle on and develop later.
I think it'll have to be outputs:
. Conda has no provides:
or conflicts:
type dependency rules, so the two packages can't occupy the same namespace within site-libs
, plus you wouldn't want packages depending on the GUI backends and those not requiring them to be mutually exclusive.
First -- I apologize for having no ida what "outputs" does...
But -- what is the goal of splitting it up?
possible goals:
1) making the packages smaller and easier to build -- that's going to require really splitting things up -- I'm guessing more work that it's worth. I"d probably only do that if MPL itself wants to better support that.
2) making the dependencies lighter weight: -- i.e. if I am using matplotlib for a headless application, I really shouldn't have to install pyqt, etc.
For that use case, we can still have a single complete package, say matplotlib_nogui, and then have other packages that are only dependencies. so you install:
matplotlib_qt
And it has only dependencies: matplotlib_nogui pyqt ...
or maybe only have a "matplotlib" package that depends on everything - that's what it will have to be in the short term anyway.
Though I'd love to go with conda install "matplotlib" and NOT get qt or anything optional
To some extent, I'm not sure it's wrong for the matplotlib package NOT to have the dependency on qt or GTK, or wx, or... if you are building an application or package with QT, you are going to know that, and can add the dependency yourself.
outputs are conda-build's way of emitting more than one package from a single recipe. Docs are at https://conda.io/docs/user-guide/tasks/build-packages/define-metadata.html#outputs-section
Conda 4.4 will add some support for optional dependencies. More info at https://github.com/conda/conda/issues/3299 and https://github.com/conda/conda/pull/4982
Conda 4.4 will add some support for optional dependencies
nice! I would like that.
Thinking a bit more -- do the GUI back-ends add much overhead at all anyway? i.e. is there any reason NOT to build MPL with support for all of them?
IIRC, it was a pain back in the day to build MPL with wx support, because you needed the right wx installed, and then it was a hard dependency on that version (and before conda or even wheels, that was a major pain!).
But once we are set up to build the full thing -- it didn't add a big binary or anything -- the big stuff is the GUI package itself.
-CHB
Yes, they pull in a lot of extra dependencies. This is painful for applications that have narrow size limits. That said, I would be ok building with all of the GUI dependencies as long as they can be made optional somehow.
I think it'll have to be
outputs:
. Conda has noprovides:
orconflicts:
type dependency rules, so the two packages can't occupy the same namespace withinsite-libs
, plus you wouldn't want packages depending on the GUI backends and those not requiring them to be mutually exclusive.
IDK could have matplotlib
depend on matplotlib-nogui
and use always_include_files
in matplotlib
to forcibly overwrite (relevant or all) contents of matplotlib-nogui
.
@ChrisBarker-NOAA
In numbers: 830MB for matplotlib total, of that 400MB for MKL (a dozen binaries specific to various architectures) and 230MB Qt5, sometimes a lot more, depending on how conda resolves the chain.
It's also not just the space occupied that causes issues: The dependency resolver in conda is horribly slow and can easily get stuck trying to find a solution in a large constraint tree. Less dependencies lowers the risk of making it try to find a needle in a haystack. Another issue is the sheer number of files. On an SSD backed workstation it's hardly noticeable, but linking or copying thousands of files to create an environment takes quite a while on NFS or on CI systems.
Things like diligently splitting packages into very small sub-packages, stripping the debug symbols from libraries and placing them in separate packages, compressing large files, etc. sound unimportant when looking at e.g. Debian from the outside. With a "distro" that doesn't do all this rigorously, though, you install "just the bare necessities" (numpy, pandas, mkl, pip, matplotlib) and you've filled nearly 2GB.
Still not an issue on the laptop or workstation, but in a docker run on a CI service this adds up quickly. I've got a project where installing the conda environments costs almost half an hour. At 1500 free minutes a month that's only 50 builds per month I can do before I need to start spending money.
@jakirkham I'll try to split the package with outputs:
. The always_include_files
way frightens me a little, it would at the very least have to use the run_constrained
mechanism to make sure we don't get mixed versions of matplotlib installed.
The
always_include_files
way frightens me a little, it would at the very least have to use therun_constrained
mechanism to make sure we don't get mixed versions of matplotlib installed.
Could you elaborate a bit on why run_constrained
would be needed here? To be clear, am proposing that matplotlib
would always depend on matplotlib-nogui
.
As a side note, matplotlib-nogui
is a misnomer given tk
is around. Maybe it should be called matplotlib-tk
instead?
Two packages with one essentially being a subset of the other would just add more space consumed ($CONDA/pkgs/...
). Also, if it actually is a subset, splitting should not be too difficult. There may be settings regarding available backends stored somewhere. And if that is the case, the order in which the packages are installed matters if they overwrite one another.
Regarding package names, I see the rationale that tk
would always be there, but package authors are users too, and matplotlib
vs matplotlib-tk
makes the uninitated choose the former. Whereas matplotlib-nogui
is an easy choice for any tool that doesn't offer a gui (even if the gui will work with tk
). matplotlib-minimal
sounds like it's missing things.
So matplotlib-core
maybe, with matplotlib
installing both matploglib-gui-backends
and matplotlib-core
?
On Mon, Sep 11, 2017 at 5:48 PM, Elmar Pruesse notifications@github.com wrote:
Regarding package names, I see the rationale that tk would always be there,
If we can strip out qt, etc, and it's worth doing, then why not strip out the tk back-end too?
makes the uninitated choose the former. Whereas matplotlib-nogui is an easy choice for any tool that doesn't offer a gui (even if the gui will work with tk). matplotlib-minimal sounds like it's missing things.
agreed.
So matplotlib-core maybe, with matplotlib installing both matploglib-gui-backends and matplotlib-core?
matplotlib-core would be OK, too, but that almost sounds incomplete.
-CHB
@epruesse: yes, install size really does matter -- exactly why I don't want f-ing QT installed in most of my environments, even though I never use it!
But how big is the actual back-end, vs the dependencies -- both qt and wx are pretty darn large (and pyGTK?), but I'm not sure how large the back-end in MPL are. I'll go look...
Now I'm confused, I thought MPL installed qt be default, but just now I tried in, and got:
The following NEW packages will be INSTALLED:
cycler: 0.10.0-py36_0 conda-forge
freetype: 2.7-1 conda-forge
libpng: 1.6.28-1 conda-forge
matplotlib: 2.0.2-py36_2 conda-forge
mkl: 2017.0.3-0
numpy: 1.13.1-py36_0
python-dateutil: 2.6.1-py36_0 conda-forge
pytz: 2017.2-py36_0 conda-forge
tornado: 4.5.2-py36_0 conda-forge
and, well, we'll need all those (except maybe tornado -- what is that used for? I know it's used by jupyter, and thus probably the mpl inline, but by itself?? the web backend?
anyway, that's a lot -- is removing the back-end code going to help in any meaningful way? I just created a new environment, with just matplotlib (and all its deps) on OS-X:
558MB !!
The backends dir is 2.5MB --kinda big, but 0.4% of the whole install, and probably half of that is required anyway.
So what problem are we trying to solve??
I add ipython, and it's 603 MB -- so another 45 MB for ipython
and still no qt -- maybe that was fixed -- thanks!
Jupyter, on the other hand, brings in qt and pyqt (and the qtconsole)--anyone know why?!?
qt is by far the biggest part of that (ICU pretty big, too)
now up to 1.23GB -- so another 600MB! a lot of that QT.
So we really need to not install QT by default with jupyter!
-CHB
So
matplotlib-core
maybe, withmatplotlib
installing bothmatploglib-gui-backends
andmatplotlib-core
?
đź‘Ť
@ChrisBarker-NOAA, are you using hardlinks here? These costs should be only once.
I am not saying I'm opposed to splitting things, but it needs to be done with razor sharp precision and attention to detail.
@ChrisBarker-NOAA https://github.com/chrisbarker-noaa, are you using hardlinks here? These costs should be only once.
Sure -- that's the beauty of conda environments :-)
But I was responding to another post that size matters :-) -- and not only absolute size, but number of links.
I am not saying I'm opposed to splitting things, but it needs to be done with razor sharp precision and attention to detail.
I'm not either, but I'm still quite confused as to what we are gaining -- it looks like pretty trivial gains to me. And non-trivial gain.
But once figured out, probably not a big deal to maintain, so if someone wants to do it -- why not?
Hardlinks don't work on all filesystems. Symlinks don't work with all tools. So if you've got a system conda installation or in some docker cases or if you use things like snakemake which place the environments with the data you're down to copying.
People, please never ever use always_include_files
to allow you to create packages with overlapping files.
As soon as you install both then remove one, the overlapping file is removed.
In fact, conda-build
errors out if you try to create split packages with overlapping files for this reason (and others).
@mingwandroid Yes, that one is dangerous. There is a reason why dpkg
has a database to know which files belong to which package and refuses to install over existing files from another package, and why Debian has that awkward alternatives
symlinking mechanism to deal with cases where packages need to occupy overlapping filesystem namespace.
Any update on this? Seems we are ready for optional deps as this has been merged https://github.com/conda/conda/pull/4982
If you simply want to reduce size and installation time, it helps a lot to avoid the MKL packages (@epruesse mentioned these above) by installing the nomkl
package/feature along with matplotlib.
On my Linux system, an environment created with conda create -n mpl matplotlib
takes up 1.5 GB, but with conda create -n mpl matplotlib nomkl
, it occupies 690 MB. I don’t know how to easily check download sizes, but I assume they would also be cut roughly in half. It’s possible this works only on Linux.
Maybe we can build with all backends and use run_constrained
to ensure correct version constraints are matched if those optional backends are installed. Thoughts?
In #157 a setup without PyQt was proposed. Could it be worth submitting that PR as a new feedstock for matplotlib-core
as a starting point? This would address this issue for people wanting to use another Qt backend. Then update this feedstock to depend on matplotlib-core
(and eventually incorporate other ideas proposed above)?
I’m actually suggesting to not create a matplotlib-core
package, just to drop the pyqt dependency from the normal matplotlib
package. The Tk backend would always be available for anyone who wants to do interactive plotting. (As argued above, Tk is a dependency of Python anyway, so it wouldn’t make sense to split it out.)
Yes, pip install matplotlib
does not pull PyQt, but still, I have a feeling that having a matplotlib-core
package would be more consistent with the default conda channel, and other packages in the conda-forge ecosystem that went that way (e.g. https://github.com/conda-forge/dask-feedstock/issues/22#issuecomment-318903133 ).
Having matplotlib
not pull PyQt will have as practical effect to change the backend from PyQt5 to Tk for most conda-forge users, and I'm not sure that's desirable.
That's why I'm suggesting build in the presence of all backends and ensure their requirements are matched should users install them, but don’t require any of them. Thus users wanting PyQt5 can explicitly request it.
Part of the issue with matplotlib-core
is many packages require matplotlib
. So breaking out matplotlib
doesn’t actually solve the issue if these downstream packages are needed.
I noticed that the presence of the pyqt
module at build time doesn’t actually make a difference to the resulting package. Matplotlib always builds the backend (or rather: installs the necessary .py
files). I assume this is the case also for the other interactive backends. This possible makes things a bit simpler.
I tested this by removing pyqt
from the build requirements: The only difference (in non-.so
files) was that the default backend in the matplotlibrc
file was TkAgg in one case and Qt5Agg in the other.
It probably still makes sense to have the backend(s) in the build environment so matplotlib can check their versions.
Here’s what the consequences of merging #157 would be:
pyqt
, pyside
, pyside2
, wxpython
etc. in addition to matplotlib
and the corresponding backend will be usable.Downsides are:
matplotlib.use("Qt5Agg")
to their application.I’d like to focus on the Tk and Qt5Agg backends at the moment since that is what the current recipe supports.
I noticed that the presence of the pyqt module at build time doesn’t actually make a difference to the resulting package.
Nice, I was wondering about that.
The only difference (in non-.so files)
That was pretty much my question in https://github.com/conda-forge/matplotlib-feedstock/issues/155 , are we sure there the backend available at the build time doesn't impact the generated .so files? Else why would this feedstock include PyQt at buid time (apart for the matplotlibrc generation) . I mean it seems to work in practice but is this the right thing to do?
was that the default backend in the matplotlibrc file was TkAgg in one case and Qt5Agg in the other.
Wouldn't that mean that even if PyQt5 is installed, the default backend will still be Tk, so users would need to modify their code to use PyQt5 backend?
Scripts relying on matplotlib’s default interactive backend being Qt5Agg break. I don’t know how this is even possible. The only scenario I can come up with is a PyQt application that embeds matplotlib widgets, but I guess even then it would use the Qt5Agg backend explicitly.
A rough estimate could be the ratio of the results of the "import+matplotlib.backends+qt" Github search query to the "matplotlib" one. Probably overestimated but that could be of the order of 33k / 2.4M ~ 1.3% -- possibly low enough to ignore..
The only difference (in non-.so files) That was pretty much my question in #155 , are we sure there the backend available at the build time doesn't impact the generated .so files?
All .so
files were different between the two builds, even those not related to backends. I’m assuming that a timestamp is embedded in the files. However, there’s no .so
file related to the Qt backends anyway, everything appears to be in pure Python. And I did verify that it works to install PyQt/PySide/PySide2 afterwards, so I think we’re good.
If I interpret the setup routines in matplotlib 2.2.2 correctly, then the only extension for which an .so
is compiled is GTKAgg and and that one is deprecated and was removed recently.
Wouldn't that mean that even if PyQt5 is installed, the default backend will still be Tk, so users would need to modify their code to use PyQt5 backend?
Yes, this is correct. I noticed that, too. Note that this problem exists no matter how the package is split: Even a matplotlib-core
package would need to come with a matplotlibrc
configuration file that sets a backend that is not Qt. A matplotlib-qt
that has matplotlib-core
as a dependency would not be able to override it.
The python3-matplotlib
package in Ubuntu 18.04 sets TkAgg as default backend, by the way.
Probably overestimated but that could be of the order of 33k / 2.4M ~ 1.3% -- possibly low enough to ignore.
Thanks for looking into these statistics! I have to admit I’m biased as I only ever use matplotlib non-interactively or within Jupyter notebooks, but I would also say this is acceptable.
Thanks for investigating this @marcelm !
Wouldn't that mean that even if PyQt5 is installed, the default backend will still be Tk Yes, this is correct. I noticed that, too. Note that this problem exists no matter how the package is split.
This is problematic IMO: maybe patching setup.py
in
to use Qt5 by default, then if it's not found let matplotlib fall back to Qt4, Pyside, Tk, etc using it's backend detection mechanism could be a solution?
The matplotlibrc
typically lives in the user's home directory (though it can live other places as well. So users should be able to easily override the default backend this way. Not to mention, user codes can already run matplotlib.use
to change this.
ref: https://matplotlib.org/users/customizing.html#the-matplotlibrc-file
both qt and wx are pretty darn large
I can talk about wx. The reason why those binaries are so large in Conda Forge, is because by default the build is a debug build. While building it, I disable the debug flag. See https://github.com/AnacondaRecipes/wxpython-feedstock/blob/4.0.2/recipe/0003-Don-t-enable-debug-info-for-all-builds.patch . The binaries on defaults are half the size compared to the ones in conda-forge and the tarball size is less than one-third.
Opened an issue: https://github.com/conda-forge/wxpython-feedstock/issues/21
It would be great to split this recipe into several meta-packages. That way, matplotlib doesn't have to depend on tk / qt etc. in the same way that jupyter doesn't pull in all of the dependencies needed for the notebook.