alteryx / evalml

EvalML is an AutoML library written in python.
https://evalml.alteryx.com
BSD 3-Clause "New" or "Revised" License
761 stars 86 forks source link

Add __str__ for Auto Search objects #481

Closed kmax12 closed 4 years ago

kmax12 commented 4 years ago

I came across this while using EvalML to solve a problem.

It would be useful to be able to get a repr of the automl object after it's been defined. Basically, printing it out would give you information about the search that is about to happen.

This can because useful because if a user doesn't specify some parameters, we set defaults. By printing the object, they can then see all those defaults such as search budget, pipelines that will be searched, datachecks that will happen, etc.

I started to look into this and one thing I noticed is that if a user doesn't define certain parameters our logic for setting the default doesn't occur until the user calls search. For example, handling max_time and max_pipelines. We may want to move the logic into the init method, so the user can see what would happen without having to actually run search.

dsherry commented 4 years ago

Yeah, cool, would be great. Would it print different stuff after the search has been run?

Yes, we should improve how AutoBase parameters are set. #319 tracks this, although we need to decide what we want to try there, and its in the icebox.

I'd also expect #272 to partially fix that, by moving more stuff out of AutoBase and into the scope of the Tuner abstraction.

kmax12 commented 4 years ago

ya, i'd want it to give as much info about the state of the search as possible (with out overloading the user)

i think this is also something we could progressively enhance if we wanted to just start with something simple.

Yes, we should improve how AutoBase parameters are set. #319 tracks this, although we need to decide what we want to try there, and its in the icebox.

I wasn't necessarily complaining how the params are set. i think it's good that i dont have to define everything, but it'd be nice to still let me know what defaults were selected without me having to reference the documentation

dsherry commented 4 years ago

@christopherbunn mentioned in #563:

So after discussing a bit with @jeremyliweishih we agreed that the info displayed is good but it would be best suited for str rather than repr. I will make these changes and also include a way to easily access the string through an AutoBase.describe() method.

@christopherbunn could you please post a short pseudocode snippet here which describes what you're planning? That would be helpful for me/others to understand. Thanks.

christopherbunn commented 4 years ago

Sure. Here's what I'm thinking @dsherry :

class AutoBase:
   def __str__(self):
        # code to construct string representation
       return str_rep
   def describe():
      print(self)

Admittedly describe() seems redundant given that users could just just call print(automl) and get the same result but it's inline with how we describe components. Alternatively we could call it describe_search() so that it is in line with describe_pipeline().

dsherry commented 4 years ago

@christopherbunn ah got it. Can you include example usage, like an example of what happens when you call print(automl), repr(automl), automl.describe() etc.? Let's focus the conversation there first and make sure we all agree on what behavior we want.

christopherbunn commented 4 years ago

@dsherry: print(automl) and automl.describe() should print out the below info (or something similar):

image

Based on what the Python docs say, my understanding of repr is that it should either return a string that when passed as a parameter in eval() returns an object instantiated with the same values or it should show in brackets the class of the object as well as the address. Currently, a repr of AutoBase returns the later (as seen here):

image

I definitely considered implementing repr so that it creates the string (e.g. AutoClassificationSearch(objective='f1', multiclass=False, max_pipelines=5, <etc. for all input args>) but I found that you would essentially have to serialize many of the parameters (for example: the pipelines in possible_pipelines and the pipeline results, the tuner with the parameters, the plots, etc.) in order to make an equivalent string. It seemed out of scope for this particular issue, especially since we just want to use give the user an idea of what the search parameters are.

dsherry commented 4 years ago

@christopherbunn thanks, that's a great starting point. One thing I'm hearing is that perhaps we shouldn't implement AutoBase.__repr__, and just rely on the default value there, right? We could also have it do the same thing as __str__. It doesn't feel super important to me, as long as we have a good __str__.

For AutoBase.__str__: let's draw a distinction between configuration parameters and search results. Let's print out parameters first, and then search results next. That way, when we print an automl obj which hasn't been run yet, we'll get the same first lines we would with one which has been run.

Here's an example of what I mean. Pre-search:

> automl = AutoClassificationSearch('Log Loss', multiclass=False, max_pipelines=5, ...)
> print(automl)
Auto Classification Search
Parameters
    objective: Log Loss
    problem type: binary
    max pipelines: 5
    max_time: None
    ...

Post-search:

> automl.search(x, y)
> print(automl) # first part is the same as before
Auto Classification Search
Parameters
    objective: Log Loss
    problem type: binary
    max pipelines: 5
    max_time: None
    ...
Search results
<include the output of "str(self.rankings)">

What do you think of that? Do we need any post-search state beyond the rankings leaderboard?

For implementation I suggest making a template string for the parameters part and for the learned state, substituting args into that using the str format function, and then adding the two chunks together and returning that. That way its easy to conditionally omit the learned state for when no search has run.