Error handling for code with third-party dependencies

dsherry commented 4 years ago

We currently have at least one third-party library dependency (xgboost) and another on the way (catboost #247). This issue tracks figuring out how we present that to users.

Questions:

What happens when a user tries to run a pipeline with a third-party dependency which isn't installed?
How can we allow users to opt in or out from using pipelines with third-party dependencies during automl?

My current thought is that the code @angela97lin is adding in the catboost PR is a good start, i.e. we have each estimator or component throw an error when the underlying library is missing. But I think we can do more that that too. We could catch that error in the automl, skip the pipeline in question, perhaps print a warning, and continue the search. We could also ask users to install third-party deps by default, by including them in requirements.txt, which should cover the majority of cases.

Adding related comments made in #247 to keep things in one place:

Do we include pipelines with third-party dependencies in the automl search? (I think the fact that we're adding them to the codebase implies "yes" unless we discover they're not useful for some reason)
What should the behavior of third-party pipelines and components be for users who don't have the third-party dependency installed? There's several options: error, warning, silent failure/skip, or do a library check before the automl search starts and don't include that pipeline to begin with (this last is my favorite)
How do we add test coverage of our answers to the above?

dsherry commented 4 years ago

Note @jeremyliweishih we should decide if this sort of thing is in-scope for the pipeline project or if we want to punt it

jeremyliweishih commented 4 years ago

I think it should be it's own one-off issue and not part of the requirements for the pipeline project as there isn't much tie in with anything else part of the project. We can always add the pipeline project as a blocker for a long term fix and include a simple patch (include all third-party deps so far) first.

dsherry commented 4 years ago

That sounds good to me. In that case I think we should consider this as blocked on the pipeline project, because the solution could change depending on what we change about pipelines

dsherry commented 4 years ago

@kmax12 :

My instinct here is that we should complete the phase 1 pipeline project @jeremyliweishih is working on, then circle back to this issue and build a sustainable way to make third-party dependencies optional.

And until then, we may continue to merge things like #247 (which add new third-party deps to requirements.txt but also add rudimentary error handling as a fallback), with the understanding we'll circle back and make them optional when we address this issue.

Do you have an opinion on that plan?

dsherry commented 4 years ago

@kmax12 said in meeting today:

He's on board with this plan. Separating third-party deps from the pipeline project
Two categories of third-party deps: 1) libs for modeling pipelines, 2) feature-specific libraries like S3
Look at featuretools: pip install eval[complete]. And evalml
It's good to have a bare-bones installation, which doesn't require everything
There's some deps in there which should be removed altogether, like plotting libs
Next step for this feature is a design doc on how we'd get this done
For now, import_or_raise is helpful to use, as Angela did in catboost PR

angela97lin commented 4 years ago

Adding additional notes from comments made in #247 :

@dsherry mentioned cb_error_msg (error message raised when catboost not installed) and other similar messages could be an attribute of Estimator. For example, we could do the following:

Define a libs variable in Estimator which is an empty list by default
Define class method Estimator.libs_err_msg(lib), which by default takes a format string "{lib} is not installed. Please install using pip install {lib}." and applies the lib argument to it
Have Estimator.init do something like for lib in self.libs: import_or_raise(lib, self.libs_err_msg(lib))
Have each class define libs to be a list with one or more third-party library names
And optionally, each class could define an override of libs_err_msg(lib) if needed

I think it could be even worth having at the ComponentBase level so that any component requiring a third-party library could use this framework.

dsherry commented 4 years ago

I'll try to knock this out this week. Next step is to list out reqs and desired behavior, and decide on an API.

dsherry commented 4 years ago

@rwedge and I discussed this yesterday.

Background on configuring installation via setuptools Setuptools supports "extras" which we could use to do something like pip install evalml[complete] which can be configured to install extra packages on top of pip install evalml.

Note it appears this mechanism supports installing additional packages, and potentially updating the versions of previously installed packages, but not uninstalling packages. That's fine.

Installation options Here's some ways we could use pip extras to rearrange our required dependencies to optionally exclude third-party libs:

evalml: only install sklearn. evalml[complete]: also include third-party (xgboost/catboost)
evalml: do nothing / error. evalml[minimal]: only install sklearn. evalml[complete]: also include third-party.
There may be a way to rig things up such that using evalml includes third-party but evalml[minimal] does not, but unsure.

Option 1 feels the most appealing. I'm unaware of any good options for handling this stuff beyond sticking with our current setuptools setup.

Code support options Not necessarily mutually exclusive:

Run import_or_raise at fit-time; have automl skip on failure
Have each pipeline list third-party deps as metadata; run import_or_raise at init-time; have automl exclude pipelines whose imports fail
Introspection: write some code which scans through the packages used by each pipeline class at init-time
Registry: define a singleton class which could provide a central listing of pipelines and could encapsulate some of this functionality. We've discussed this in the design doc/notes for #345

Option 1 feels best to me.

Questions for consideration

If we use setuptools extras, are we comfortable supporting more than one pip install target? We'll need to have test coverage for both.
What about preprocessing components which use third-party libraries? How does our strategy need to change, if at all?
When do we want checks to happen, at init-time or fit-time?

Proposal

Move third-party dependencies to extras, meaning pip install evaml won't install them but pip install evalml[complete] will.
Have all third-party pipelines (xgboost/catboost) run import_or_raise (already done)
Ensure automl skips third-party pipelines if import_or_raise fails.
Testing: update current tests to install complete version. Add an integration test which installs minimal version and checks key functionality. Run that on checkins to master.
Update documentation.

Advantages: Allows users to install evalml without third-party libraries. Disadvantages: Default install isn't as powerful. We'd have to support two install versions. Could be ugly setting up automl to skip failed pipelines.

Long-term, it would be neat if we can find a solution which raises an error or warning earlier than fit-time.

@kmax12 do you have an opinion on this issue?

dsherry commented 4 years ago

I did a review of all the dependencies in requirements.txt. I included the size of each library's lib dir in the virtualenv on my mac, to gauge importance.

Packages which are required by evalml and would be quite difficult to make optional:

numpy (86MB): required
pandas (47MB): required
scipy (126MB): required
scikit-learn (29MB): required
scikit-optimize[plots] (572KB): required for automl tuners
category_encoders (776KB): required for one-hot encoder
cloudpickle (132KB): required for saving and loading pipelines

Packages which could potentially be removed:

joblib (1.9MB): unsure, no direct refs. I'll try removing and update.
plotly (51MB): used to generate pipeline plots, in documentation. It's possible we could update to be disabled by default.
ipywidgets (824KB): used to generate pipeline plots, in documentation. It's possible we could update to be disabled by default, in conjunction with plotly
dask[complete] (50MB): currently only referenced in a utility used in demo code. When we add distributed support, it will become required for that. We could use import_or_raise there for now though
colorama (72KB): used in logging for color definitions. We could probably remove it, but maybe not worth it since it’s under 100KB.
tqdm (316KB): powers the console output in automl search. We may be able to update it to simply be disabled by default. But it's a small package.

Conclusion: we could reduce installer size by a lot if we make more things optional.

Does this affect our plan for handling third-party pipelines? We could move towards supporting multiple extras:

pip install evalml: minimal only, no third-party blueprints, no plotting, no distributed support
pip install evalml[thirdparty]: include third-party blueprints
pip install evalml[plotting]: minimal, plus plotting
pip install evalml[distributed]: minimal, plus distributed support
pip install evalml[plotting,distributed,thirdparty] or pip install evalml[complete]: include everything

This would work fine. It feels overly complicated though. And it raises concerns about testing: we'd at least need an integration test for each to ensure the library works with the subset of deps.

An alternative would be to only support two targets, the minimal default and complete which includes everything else. Users could then install specific dependencies by hand if they wanted a subset of complete.

dsherry commented 4 years ago

Solution discussed with @kmax12 :

To avoid deps: pip install --no-dependencies evalml then install specific deps manually
We can even make a Makefile command for that
Update documentation to describe how to do this
Our code doesn't explode if a package is missing
By default: include xgboost/catboost. Include plotting. Exclude dask/distributed (update code to not use that)

An option to consider for long term: build two setup.py packages: evalml-base minimal and evalml everything.

alteryx / evalml

Error handling for code with third-party dependencies #315