Use a simple linear and non-linear model as a baseline for measuring progress.

rasbt commented 5 years ago

Have you checked the list of proposed rules to see if the rule has already been proposed?

[x] Yes

Use a simple, linear model, e.g., (multinomial) logistic regression or (multiple) linear regression as a performance baseline. This may be expanded to also include off-the-shelf easy-to-use ensemble methods like random forests that are relatively robust and don't require much tuning to work well out of the box.

There are many situations where a ML/DL expert can, without much doubt, say that throwing DL at the problem (given the size of the dataset and the nature of the task) doesn't make sense. However, let's face it, DL is popular, and people want and will use it even if it is not always the best thing to do in every situation. I think a good thing to recommend though is to start with a linear model as a baseline and compare the DL efforts to it.

But on the more positive side, using a simple model as a baseline is also useful in situations where using DL does make sense.

Any citations for the rule? (peer-reviewed literature preferred but not required)

-

jmschrei commented 5 years ago

Seems similar to #10 and #11, though a little more specific.

rasbt commented 5 years ago

I see. From the titles, it wasn't clear that base line models refer to "traditional" machine learning/statistical models. So basically, I would go a step further though and exclude SVMs etc. from baseline models but really focus only on the "simplest" models, otherwise, every study will be a large scale comparison study among all possible models.

evancofer commented 5 years ago

I would actually suggest trying out most of the "traditional methods" that are already implemented in scikit-learn or similar libraries. If you've tried out one of the scikit-learn models, it's not a big task to make the code slightly more abstract and try out several (or all) of the applicable scikit-learn models. Establishing these baselines is generally not time consuming.

rasbt commented 5 years ago

Sure, I agree. If that's feasible, I'd say the "more the merrier" :). However, reminds me I just ran nested CV for algo comparison on 4 algos with very small param grid on a small (10%) MNIST subset for a class example, and it took already up to an hour (https://github.com/rasbt/stat479-machine-learning-fs18/blob/master/11_eval-algo/11_eval-algo_code.ipynb) :P

I like your suggestion though, maybe just want to add doing that with default params or so, otherwise it might turn into a large scale benchmark endeavor, and algorithm comparison is tricky (reg. the typical multi-hypothesis testing issues)

evancofer commented 5 years ago

An hour is sort of inconsequential, since using DL (esp after hyperparameter optimization) for those same tasks will take days. Maybe we should suggest that if the user isn't willing to look at many other standard algorithms, they shouldn't consider DL approaches.

jmschrei commented 5 years ago

I think that it's important to try out a few good baselines, such as a well tuned linear model and perhaps gradient boosting, but I really don't think it's necessary to go much further than that. Certainly it may be worth investigating the real benefit of a deep network if the performance improvement is minor, but I don't agree that a person needs to run most of scikit-learn before using a deep model.

rasbt commented 5 years ago

An hour is sort of inconsequential, since using DL (esp after hyperparameter optimization) for those same tasks will take days. Maybe we should suggest that if the user isn't willing to look at many other standard algorithms, they shouldn't consider DL approaches.

Generally, I agree. But say there are 50 models or so in sklearn? Then with a realistic hyperparam grid, you would end up with I dunno 300 h - 500 h just on this tiny 2828 MNIST dataset. Given that typical datasets for DL are much larger (think of medical imaging) and that algos don't scale linearly with the number of features, you can end up easily +10k hours just for benchmarking all of these. Also, the goal is not only* to find the best algo but to get a feeling for the problem before using DL. If you get 90% acc on a balanced dataset with traditional ML it's still not a reason to give up on DL.

I.e., you can get 93% accuracy with multinomial logistic regression on MNIST, but it doesn't mean that it's the desired end goal. Having this as a reference value (assuming MNIST is a new dataset and we don't know much about it) is very valuable when investigating other methods. Also, I wouldn't say that e.g., a kernel SVM with 95% on a dataset is more desirable than a ConvNet with the same performance on the same dataset. Depending on the size of the dataset the ConvNet may be actually much cheaper to evaluate and be preferred for engineering reasons. Or in other words, while traditional methods are often preferred, they shouldn't be always preferred for the sake that DL is currently overhyped.

I think that it's important to try out a few good baselines, such as a well tuned linear model and perhaps gradient boosting,

Yeah, essentially picking representative ones.

fmaguire commented 5 years ago

@rasbt Do you still want to discuss linear/non-linear models in Tip 2?

Benjamin-Lee / deep-rules

Use a simple linear and non-linear model as a baseline for measuring progress. #41