[REFACTOR] Too many variables in `search` and `fit_pipeline`.

ravinkohli commented 2 years ago

Why do we need that many variables now? Most information is in the BaseDataset, isn't it? Let's try reducing variable numbers as much as possible, because it will be painful when you create tests and it will increase line numbers and doc-string in many places.

Clean code also says methods with 5 variables should be considered to be re-designed. Basically, we use an object to wrap those variables in such cases.

Originally posted by @nabenabe0928 in https://github.com/automl/Auto-PyTorch/pull/348#r760952438

eddiebergman commented 2 years ago

Just a point on the variables, I don't think clean code really considers default variables. I think it's okay to generally have a highly parametrized function if most of the parameters are default and never really touched.

The other exception to the clean code principle might be for front facing API's. Sometimes it's good to let a user customize as much as they want as long as it's reliable. It can be annoying when you have to learn how to create an object just to pass it as a package of parameters to another function. Especially in the partially targeted audience of novice machine learners and scientists with low-code experience.

It does however increase strain on testing. However, these front facing API's should definitely have sane defaults and not require a user to know every parameter to get sensible outputs.

Sorry for being opinionated but I wouldn't follow clean code as a bible on what to do.

nabenabe0928 commented 2 years ago

The other exception to the clean code principle might be for front facing API's

I agree with this part.

However, I do not agree with:

it's okay to generally have a highly parametrized function if most of the parameters are default

I do not know your standard of "highly parametrized function", but you see quite often methods with several tens (I can compromise if it is around 6--8) of arguments in AutoPytorch (probably AutoSKLearn as well?). I do not think this is preferable because the same documentation will be repeated everywhere and it increases lines. Just to make sure what I meant, object, of course, means class in general, but in this context, I mostly mean NamedTuple or dataclass. Also typically, such parameters are used everywhere quite often and most of them overlap, so I think there should not be too many NamedTuple or dataclass. Another advantage is to be able to make them immutable.

One drawback of this is to force users to import this class if it is placed at the user interface. In such cases, we need more discussion as you mentioned. (I prefer to have params NamedTuple or Dataclass, but as far as I know, there are some people who do not prefer this.)

eddiebergman commented 2 years ago

The Immutability is nice to prevent accidental issues. Internally, I actually don't mind how much parameterization there is as long as the documentation is clear and I only have to specify 3/4 non-optional params at most, with the other having sensible default functionality.

I think what works is clarity, even if it means a longer func definition. However if it's the same parameters usually packed together and passed around then maybe it makes sense. I think it's a case by case basis though.

And yes Auto-sklearn has quite a few of these internally but it works for now and will be touched later. We have some more pressing issues to deal with first.

ravinkohli commented 2 years ago

hey, I also agree with having a dataclass for the search and fit_pipeline parameters. Moreover, as I imagine the implementation, we can group the search parameters like the run_time_limit_secs, memory_limit, eval_metric all_supported_metrics, disable_file_output as these are all parameters that are passed to the TAE and keep the other parameters as it is. But I think more discussion is needed on this. Also, I think that we currently have a plan of TODOs which are more important than this refactor. So, I suggest we leave this issue until we discuss how to proceed.

automl / Auto-PyTorch

[REFACTOR] Too many variables in `search` and `fit_pipeline`. #353