dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
892 stars 255 forks source link

LogisticRegression spends a lot of time in exp and log1p #146

Open mrocklin opened 6 years ago

mrocklin commented 6 years ago

Something like 200s in comparison to the 13s spent in .dot / matvec. This is from looking at the profiler after running through the dask-glm example notebook.

Is it normal to spend so much time in scalar functions? Are there approximations that make sense to do?

This is low priority from my perspective (there is plenty more to do) but I thought I'd record it.

mrocklin commented 6 years ago

Is this typical? cc @ogrisel @stsievert

stsievert commented 6 years ago

I don't think I'd expect this imbalance, but it definitely depends on the number of times each function is called in one training iteration. It looks like we should expect an equal number of calls to exp, log and dot (at least after a brief look at the source).

The motivation for SGD is when the number of examples is much larger than the features for each example. That would be one way to reduce the time for exp/log/dot, but I don't think it's directly related to this issue (though it seems relevant). I can look into this.

I've done some brief timing with datasets of the same size as the taxicab:

screen shot 2018-03-07 at 11 21 56 am
mrocklin commented 6 years ago

To be clear it's not the number of calls, it's the time spent in each function. I'm using the Profile tab of the distributed scheduler's dashboard to measure this.

stsievert commented 6 years ago

Thanks for the profiling info. I was concerned about the timings being equal but exp/log1p being called more often.

I can't reproduce while profiling the basic_api.ipynb notebook. The timings for dot/exp/log1p are approximately the same in their calls from loglike (12.25, 16.47 and 26.21 seconds respectively) on my system (early 2015 Macbook Pro, 8GB RAM) with one worker. These timings don't change significantly if I move to 4 workers.

ogrisel commented 6 years ago

Fitting a linear classification model like logistic regression with 1e8 samples and only 19 input features is unusual. A linear model is likely to be underfitting in that regime.

You can probably get a similar validation accuracy by fitting on a random subsample of 1000 samples or so (gut feeling, not checked).

In practice linear models are more useful in high dimensions. For instance you could expand your 19 input features into 1771 non-linear features with polynomial interactions using sklearn.preprocessing.PolynomialFeatureExtraction(degree=3).

Another alternative non-linear feature expansion is the Nystrom method. Using n_components in the range [1000 - 5000] sounds reasonable to me.

Then fitting a linear model on the (1e8, 1771) or (1e8, 5000) shaped transformed data makes more sense (the resulting non-linear model is more expressive, both the training and validation error should go down) and the BLAS matrix-vector product is likely to dominate the transcendental scalar function.

Note: before applying the non-linear feature expansion (nystroem or polynomial), it's a good idea to normalize the marginal distribution of the original features (StandardScaler, RobustScaler, QuantileTransformer).

stsievert commented 6 years ago

The motivation for SGD is when the number of examples is much larger than the features for each example. ... I can look into this.

My observation expressed more clearly: SGD would be useful here—this is a perfect use case for SGD. SGD’s convergence time doesn’t depend on the number examples n, only the number of features dimension d. For the taxicab dataset, “n, d = 1.2e9, 19”, so SGD (or a variant) could have a huge speed up.

the resulting non-linear model is more expressive, both the training and validation error should go down

One downside to using this is that model interpretation becomes more difficult. This matters in determining how strongly any one feature influences the prediction.

ogrisel commented 6 years ago

I agree that SGD and its variants are best in the large number of sample regime.

However I am pretty sure that using a linear model with 19 parameters to fit 1.2e9 samples will underfitt:

ogrisel commented 6 years ago

Interpreting the parameters of a severely underfitting model is probably misleading ;)

Also interpreting polynomial features is fine (if the degree is 2 or 3). You can threshold the small weights of a first linear model trained on the full polynomial feature expansion to keep the top 100 or so and refit only on those and interpret them.