Open mrocklin opened 6 years ago
Is this typical? cc @ogrisel @stsievert
I don't think I'd expect this imbalance, but it definitely depends on the number of times each function is called in one training iteration. It looks like we should expect an equal number of calls to exp
, log
and dot
(at least after a brief look at the source).
The motivation for SGD is when the number of examples is much larger than the features for each example. That would be one way to reduce the time for exp/log/dot, but I don't think it's directly related to this issue (though it seems relevant). I can look into this.
I've done some brief timing with datasets of the same size as the taxicab:
To be clear it's not the number of calls, it's the time spent in each function. I'm using the Profile tab of the distributed scheduler's dashboard to measure this.
Thanks for the profiling info. I was concerned about the timings being equal but exp
/log1p
being called more often.
I can't reproduce while profiling the basic_api.ipynb
notebook. The timings for dot
/exp
/log1p
are approximately the same in their calls from loglike
(12.25, 16.47 and 26.21 seconds respectively) on my system (early 2015 Macbook Pro, 8GB RAM) with one worker. These timings don't change significantly if I move to 4 workers.
Fitting a linear classification model like logistic regression with 1e8 samples and only 19 input features is unusual. A linear model is likely to be underfitting in that regime.
You can probably get a similar validation accuracy by fitting on a random subsample of 1000 samples or so (gut feeling, not checked).
In practice linear models are more useful in high dimensions. For instance you could expand your 19 input features into 1771 non-linear features with polynomial interactions using sklearn.preprocessing.PolynomialFeatureExtraction(degree=3)
.
Another alternative non-linear feature expansion is the Nystrom method. Using n_components
in the range [1000 - 5000] sounds reasonable to me.
Then fitting a linear model on the (1e8, 1771) or (1e8, 5000) shaped transformed data makes more sense (the resulting non-linear model is more expressive, both the training and validation error should go down) and the BLAS matrix-vector product is likely to dominate the transcendental scalar function.
Note: before applying the non-linear feature expansion (nystroem or polynomial), it's a good idea to normalize the marginal distribution of the original features (StandardScaler, RobustScaler, QuantileTransformer).
The motivation for SGD is when the number of examples is much larger than the features for each example. ... I can look into this.
My observation expressed more clearly: SGD would be useful here—this is a perfect use case for SGD. SGD’s convergence time doesn’t depend on the number examples n, only the number of features dimension d. For the taxicab dataset, “n, d = 1.2e9, 19”, so SGD (or a variant) could have a huge speed up.
the resulting non-linear model is more expressive, both the training and validation error should go down
One downside to using this is that model interpretation becomes more difficult. This matters in determining how strongly any one feature influences the prediction.
I agree that SGD and its variants are best in the large number of sample regime.
However I am pretty sure that using a linear model with 19 parameters to fit 1.2e9 samples will underfitt:
either the target function depends in a non linear manner of the 19 input descriptors and using non-linearly expanded features or a non-linear model (GBRT, neural network...) will give better predictions,
or the target function is simple and fitting a linear model on 1e5 random subset of the data will be enough and much faster (and will avoid wasting computation for nothing). Basically SGD will converge way before the first pass over the dataset :)
Interpreting the parameters of a severely underfitting model is probably misleading ;)
Also interpreting polynomial features is fine (if the degree is 2 or 3). You can threshold the small weights of a first linear model trained on the full polynomial feature expansion to keep the top 100 or so and refit only on those and interpret them.
Something like 200s in comparison to the 13s spent in
.dot
/ matvec. This is from looking at the profiler after running through the dask-glm example notebook.Is it normal to spend so much time in scalar functions? Are there approximations that make sense to do?
This is low priority from my perspective (there is plenty more to do) but I thought I'd record it.