alteryx / evalml

EvalML is an AutoML library written in python.
https://evalml.alteryx.com
BSD 3-Clause "New" or "Revised" License
774 stars 86 forks source link

Error handling for code with third-party dependencies #315

Closed dsherry closed 4 years ago

dsherry commented 4 years ago

We currently have at least one third-party library dependency (xgboost) and another on the way (catboost #247). This issue tracks figuring out how we present that to users.

Questions:

My current thought is that the code @angela97lin is adding in the catboost PR is a good start, i.e. we have each estimator or component throw an error when the underlying library is missing. But I think we can do more that that too. We could catch that error in the automl, skip the pipeline in question, perhaps print a warning, and continue the search. We could also ask users to install third-party deps by default, by including them in requirements.txt, which should cover the majority of cases.

Adding related comments made in #247 to keep things in one place:

dsherry commented 4 years ago

Note @jeremyliweishih we should decide if this sort of thing is in-scope for the pipeline project or if we want to punt it

jeremyliweishih commented 4 years ago

I think it should be it's own one-off issue and not part of the requirements for the pipeline project as there isn't much tie in with anything else part of the project. We can always add the pipeline project as a blocker for a long term fix and include a simple patch (include all third-party deps so far) first.

dsherry commented 4 years ago

That sounds good to me. In that case I think we should consider this as blocked on the pipeline project, because the solution could change depending on what we change about pipelines

dsherry commented 4 years ago

@kmax12 :

My instinct here is that we should complete the phase 1 pipeline project @jeremyliweishih is working on, then circle back to this issue and build a sustainable way to make third-party dependencies optional.

And until then, we may continue to merge things like #247 (which add new third-party deps to requirements.txt but also add rudimentary error handling as a fallback), with the understanding we'll circle back and make them optional when we address this issue.

Do you have an opinion on that plan?

dsherry commented 4 years ago

@kmax12 said in meeting today:

angela97lin commented 4 years ago

Adding additional notes from comments made in #247 :

@dsherry mentioned cb_error_msg (error message raised when catboost not installed) and other similar messages could be an attribute of Estimator. For example, we could do the following:

I think it could be even worth having at the ComponentBase level so that any component requiring a third-party library could use this framework.

dsherry commented 4 years ago

I'll try to knock this out this week. Next step is to list out reqs and desired behavior, and decide on an API.

dsherry commented 4 years ago

@rwedge and I discussed this yesterday.

Background on configuring installation via setuptools Setuptools supports "extras" which we could use to do something like pip install evalml[complete] which can be configured to install extra packages on top of pip install evalml.

Note it appears this mechanism supports installing additional packages, and potentially updating the versions of previously installed packages, but not uninstalling packages. That's fine.

Installation options Here's some ways we could use pip extras to rearrange our required dependencies to optionally exclude third-party libs:

  1. evalml: only install sklearn. evalml[complete]: also include third-party (xgboost/catboost)
  2. evalml: do nothing / error. evalml[minimal]: only install sklearn. evalml[complete]: also include third-party.
  3. There may be a way to rig things up such that using evalml includes third-party but evalml[minimal] does not, but unsure.

Option 1 feels the most appealing. I'm unaware of any good options for handling this stuff beyond sticking with our current setuptools setup.

Code support options Not necessarily mutually exclusive:

  1. Run import_or_raise at fit-time; have automl skip on failure
  2. Have each pipeline list third-party deps as metadata; run import_or_raise at init-time; have automl exclude pipelines whose imports fail
  3. Introspection: write some code which scans through the packages used by each pipeline class at init-time
  4. Registry: define a singleton class which could provide a central listing of pipelines and could encapsulate some of this functionality. We've discussed this in the design doc/notes for #345

Option 1 feels best to me.

Questions for consideration

Proposal

Advantages: Allows users to install evalml without third-party libraries. Disadvantages: Default install isn't as powerful. We'd have to support two install versions. Could be ugly setting up automl to skip failed pipelines.

Long-term, it would be neat if we can find a solution which raises an error or warning earlier than fit-time.

@kmax12 do you have an opinion on this issue?

dsherry commented 4 years ago

I did a review of all the dependencies in requirements.txt. I included the size of each library's lib dir in the virtualenv on my mac, to gauge importance.

Packages which are required by evalml and would be quite difficult to make optional:

Packages which could potentially be removed:

Conclusion: we could reduce installer size by a lot if we make more things optional.

Does this affect our plan for handling third-party pipelines? We could move towards supporting multiple extras:

This would work fine. It feels overly complicated though. And it raises concerns about testing: we'd at least need an integration test for each to ensure the library works with the subset of deps.

An alternative would be to only support two targets, the minimal default and complete which includes everything else. Users could then install specific dependencies by hand if they wanted a subset of complete.

dsherry commented 4 years ago

Solution discussed with @kmax12 :

An option to consider for long term: build two setup.py packages: evalml-base minimal and evalml everything.