Closed dsherry closed 4 years ago
Note @jeremyliweishih we should decide if this sort of thing is in-scope for the pipeline project or if we want to punt it
I think it should be it's own one-off issue and not part of the requirements for the pipeline project as there isn't much tie in with anything else part of the project. We can always add the pipeline project as a blocker for a long term fix and include a simple patch (include all third-party deps so far) first.
That sounds good to me. In that case I think we should consider this as blocked on the pipeline project, because the solution could change depending on what we change about pipelines
@kmax12 :
My instinct here is that we should complete the phase 1 pipeline project @jeremyliweishih is working on, then circle back to this issue and build a sustainable way to make third-party dependencies optional.
And until then, we may continue to merge things like #247 (which add new third-party deps to requirements.txt but also add rudimentary error handling as a fallback), with the understanding we'll circle back and make them optional when we address this issue.
Do you have an opinion on that plan?
@kmax12 said in meeting today:
pip install eval[complete]
. And evalmlimport_or_raise
is helpful to use, as Angela did in catboost PRAdding additional notes from comments made in #247 :
@dsherry mentioned cb_error_msg
(error message raised when catboost not installed) and other similar messages could be an attribute of Estimator. For example, we could do the following:
I think it could be even worth having at the ComponentBase
level so that any component requiring a third-party library could use this framework.
I'll try to knock this out this week. Next step is to list out reqs and desired behavior, and decide on an API.
@rwedge and I discussed this yesterday.
Background on configuring installation via setuptools
Setuptools supports "extras" which we could use to do something like pip install evalml[complete]
which can be configured to install extra packages on top of pip install evalml
.
Note it appears this mechanism supports installing additional packages, and potentially updating the versions of previously installed packages, but not uninstalling packages. That's fine.
Installation options Here's some ways we could use pip extras to rearrange our required dependencies to optionally exclude third-party libs:
evalml
: only install sklearn. evalml[complete]
: also include third-party (xgboost/catboost)evalml
: do nothing / error. evalml[minimal]
: only install sklearn. evalml[complete]
: also include third-party.evalml
includes third-party but evalml[minimal]
does not, but unsure.Option 1 feels the most appealing. I'm unaware of any good options for handling this stuff beyond sticking with our current setuptools setup.
Code support options Not necessarily mutually exclusive:
import_or_raise
at fit-time; have automl skip on failureOption 1 feels best to me.
Questions for consideration
pip install
target? We'll need to have test coverage for both.Proposal
pip install evaml
won't install them but pip install evalml[complete]
will.import_or_raise
(already done)import_or_raise
fails.Advantages: Allows users to install evalml without third-party libraries. Disadvantages: Default install isn't as powerful. We'd have to support two install versions. Could be ugly setting up automl to skip failed pipelines.
Long-term, it would be neat if we can find a solution which raises an error or warning earlier than fit-time.
@kmax12 do you have an opinion on this issue?
I did a review of all the dependencies in requirements.txt
. I included the size of each library's lib
dir in the virtualenv on my mac, to gauge importance.
Packages which are required by evalml and would be quite difficult to make optional:
Packages which could potentially be removed:
Conclusion: we could reduce installer size by a lot if we make more things optional.
Does this affect our plan for handling third-party pipelines? We could move towards supporting multiple extras:
pip install evalml
: minimal only, no third-party blueprints, no plotting, no distributed supportpip install evalml[thirdparty]
: include third-party blueprintspip install evalml[plotting]
: minimal, plus plottingpip install evalml[distributed]
: minimal, plus distributed supportpip install evalml[plotting,distributed,thirdparty]
or pip install evalml[complete]
: include everythingThis would work fine. It feels overly complicated though. And it raises concerns about testing: we'd at least need an integration test for each to ensure the library works with the subset of deps.
An alternative would be to only support two targets, the minimal default and complete
which includes everything else. Users could then install specific dependencies by hand if they wanted a subset of complete
.
Solution discussed with @kmax12 :
pip install --no-dependencies evalml
then install specific deps manuallyAn option to consider for long term: build two setup.py
packages: evalml-base
minimal and evalml
everything.
We currently have at least one third-party library dependency (xgboost) and another on the way (catboost #247). This issue tracks figuring out how we present that to users.
Questions:
My current thought is that the code @angela97lin is adding in the catboost PR is a good start, i.e. we have each estimator or component throw an error when the underlying library is missing. But I think we can do more that that too. We could catch that error in the automl, skip the pipeline in question, perhaps print a warning, and continue the search. We could also ask users to install third-party deps by default, by including them in
requirements.txt
, which should cover the majority of cases.Adding related comments made in #247 to keep things in one place: