Case Study: Criteo dataset

mrocklin commented 6 years ago

The Criteo dataset is a 1TB dump of features around advertisements and whether or not someone clicked on the ad. It has a both dense and categorical/sparse data. I believe that the data is freely available on Azure.

There are some things that we might want to do with this dataset that are representative of other problems:

Logistic regression on large sparse data. This could use existing algorithms like L-BGFS or ADMM or it could use the more recent Incremental SGD work. It would be useful to compare the effectiveness of the algorithms above
We could also add hyper parameter optimization
Gradient boosted trees, presumably with the dask-xgboost connection. This raises a couple of questions. Can XGBoost support categorical data or scipy.sparse arrays? Or perhaps we have to provide a column of integers

As always, it might be a good start to just download a little bit of the criteo dataset (I think that each day of data is available separately) and work with sklearn directly to establish a baseline.

This came out of conversation with @ogrisel

stsievert commented 6 years ago

The dataset is available on Azure (source). This dataset has 13 integer features with 26 categorical features (the categorical features are hashed for anonymization) and is stored as 24 individual files (one for each day) (source).

I would rather use some SGD implementation rather than L-BGFS or ADMM because IIRC those algorithm use the entire dataset in every optimization step.

mrocklin commented 6 years ago

I suspect that for initial use we'll want to play with a single day. I played with this once and it looks like I kept around a conversion script to parquet:

import dask.dataframe as dd
from dask.distributed import Client
client = Client()

categories = ['category_%d' % i for i in range(26)]
columns = ['click'] + ['numeric_%d' % i for i in range(13)] + categories

df = dd.read_csv('day_0', sep='\t', names=columns, header=None) 

encoding = {c: 'bytes' for c in categories}
fixed = {c: 8 for c in categories}
df.to_parquet('day-0-bytes.parquet', object_encoding=encoding,
              fixed_text=fixed, compression='SNAPPY')

mrocklin commented 6 years ago

Although even a single day was a bit of a pain to work with on a single machine if memory serves. It'll be fun to eventually scale up when we're ready.

Presumably we'll start with whatever seems easiest to get decent results from, (as you say SGD) but will eventually want to try a few things. I imagine that some of the preprocessing steps will be the same.

I also recall that when I tried things I got really poor scores. Someone told me this was because clicks were rare and I wasn't weighting them properly.

I suspect that this will force a few topics:

Preprocessing
Conversion to sparse matrices (https://github.com/dask/dask/pull/3738 might help)
Some incremental learning with SGD as you suggest
Maybe early stopping?
Maybe hyper-parameter optimization?

This is also the part of the process (building a pipeline and model) about which I know very little. I'm looking forward to seeing what you end up doing. My guess is that it will be easier to iterate with a small amount of data, probably by just picking off a small bit of the parquet data with pandas and using scikit-learn natively.

stsievert commented 6 years ago

And they're on Kaggle too: https://www.kaggle.com/c/criteo-display-ad-challenge. This will allow good evaluation of our model, with a predefined score function, cross entropy loss. Top scores on Kaggle are around 0.44 (lower is better).

probably by just picking off a small bit of the parquet data with pandas and using scikit-learn natively.

That's my plan for today. I'll try to develop a preprocessing pipeline and reasonable model and test it an a subset of the data.

stsievert commented 6 years ago

Here’s a gist of a simple pipeline I put together: https://gist.github.com/stsievert/30702575de95328f199ab1d7e50795ef

Notes:

this dataset is unbalanced, with only 3% positive examples.
I take a (very) small fraction of the dataset, I think about 0.002%.
I do some preprocessing to avoid outliers. I should probably make it less complex.
The scores are very variable. I suspect we need more data.

TomAugspurger commented 6 years ago

Thanks for this. I'm playing around with it locally now.

Thinking about scaling this to larger datasets:

dd.get_dummies doesn't support sparse. Opened https://github.com/pandas-dev/pandas/issues/21993 for that, will fix it later today (have to work around a pandas bug)
Did you look at dask_ml.preprocessing.Categorizer and dask_ml.preprocessing.DummyEncoder? With this sparse of data, I think it'll be important to use (pandas) categoricals so that the transformer doing the one-hot / dummy encoding has a consistent shape on train / test sets.
I wonder if any of http://contrib.scikit-learn.org/imbalanced-learn/stable/api.html are easily scalable to larger datasets (cc. @glemaitre if he has any quick thoughts here)
You might want to have your LogisticProbs.score return the negative log loss. I think as currently written, if you tried to do hyperparameter optimization it would go the wrong direction.

I'll look into the sparse dask get_dummies now, and check back later.

TomAugspurger commented 6 years ago

I do some preprocessing to avoid outliers. I should probably make it less complex.

We have QuantileTransformer. Does that suffice? Hmm, perhaps not, as that only works with dask arrays of known shape, which we may not have at that point...

glemaitre commented 6 years ago

wonder if any of http://contrib.scikit-learn.org/imbalanced-learn/stable/api.html are easily scalable to larger datasets (cc. @glemaitre if he has any quick thoughts here)

If you have a nearest neighbors algorithm which is distributed then it should be possible to make a distributed SMOTE. The under-sampling also rely on NN so it would be most probably possible to scale.

TomAugspurger commented 6 years ago

Thanks. We don't yet have a nearest neighbors algorithm yet, but it's a TODO (also blocking HDBSCAN).

@stsievert regarding pd.get_dummies, you may want to just avoid sparse=True. Turns out it's slow (https://github.com/pandas-dev/pandas/pull/21997), and I'm not entirely sure how much memory it's actually saving... profiling now.

mrocklin commented 6 years ago

@TomAugspurger is there another approach that would you suggest? Constructing a scipy.sparse matrix manually? Or are there Scikit-Learn transformers to help with this?

On Fri, Jul 20, 2018 at 11:55 AM, Tom Augspurger notifications@github.com wrote:

Thanks. We don't yet have a nearest neighbors algorithm yet, but it's a TODO (also blocking HDBSCAN).

@stsievert https://github.com/stsievert regarding pd.get_dummies, you may want to just avoid sparse=True. Turns out it's slow ( pandas-dev/pandas#21997 https://github.com/pandas-dev/pandas/pull/21997), and I'm not entirely sure how much memory it's actually saving... profiling now.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/295#issuecomment-406643609, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszHF3RwBM-YNdRNBme7F8usWU_Q3Gks5uIf1fgaJpZM4VQUPu .

TomAugspurger commented 6 years ago

For the case study, I recommend we go down all paths simultaneously, to find things that are broken :)

Try with dd.get_dummies(sparse=False) and just pay the memory cost. You'll convert to dense anyway, since we have a mixture of sparse and dense (though you do a bit a filtering before that point, so this will have higher memory usage I think).
Implement dd.get_dummies(sparse=True) (coming soon) and see where things go
Try OrdinalEncoder on the categorical features (should work today with scikit-learn & dask-ml), though this likely won't perform as well with a linear model
Use scikit-learn dev & the new OneHotEncoder (dask version coming later today)

mrocklin commented 6 years ago

If memory serves there are hundreds of thousands of categories. Dense may be tricky?

On Fri, Jul 20, 2018 at 12:25 PM, Tom Augspurger notifications@github.com wrote:

For the case study, I recommend we go down all paths simultaneously, to find things that are broken :)

Try with dd.get_dummies(sparse=False) and just pay the memory cost. You'll convert to dense anyway, since we have a mixture of sparse and dense (though you do a bit a filtering before that point, so this will have higher memory usage I think).

Implement dd.get_dummies(sparse=True) (coming soon) and see where things go

Try OrdinalEncoder on the categorical features (should work today with scikit-learn & dask-ml), though this likely won't perform as well with a linear model

Use scikit-learn dev & the new OneHotEncoder (dask version coming later today)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/295#issuecomment-406652490, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszETizEUr-EASxO-W1_0qzW2qELE0ks5uIgR-gaJpZM4VQUPu .

TomAugspurger commented 6 years ago

Ah, there may be many not observed in the sample. In that case, we'll push down the sparse side of things.

jakirkham commented 6 years ago

Regarding nearest neighbors, there is some brute force stuff in dask-distance, which might be useful. It probably could benefit from tree approaches as well if people are interested in that problem.

stsievert commented 6 years ago

I do some preprocessing to avoid outliers. I should probably make it less complex.

We have QuantileTransformer. Does that suffice?

I'm using sklearn's QuantileTransformer. We are passing fit_intercept=True to LogisticRegression, so I think I'm okay with it.

a nearest neighbors algorithm which is distributed

There's some work with UMAP (cc @lmcinnes) to integrate Dask with approximate nearest neighbors algorithm: https://github.com/lmcinnes/umap/issues/62, which talks about modification of https://github.com/lmcinnes/pynndescent.

Did you look at dask_ml....

My goal so far has been to get a working pipeline with small data. Some bugs were in the way and I wanted to make sure I had a working pandas/sklearn implementation first.

TomAugspurger commented 6 years ago

Here's an update to @stsievert's notebook that uses a different pre-processing scheme. I haven't played with the actual estimator at all yet (and don't have short-term plans to).

https://gist.github.com/bdb99a9d1e226cc2130016b9d9d1bad4

That uses a few new features

new OneHotEncoder (in dask-ml master. Requires dask master)
new ColumnTransformer (requires https://github.com/dask/dask-ml/pull/315, which is blocked by https://github.com/scikit-learn/scikit-learn/pull/11689 and https://github.com/dask/dask/pull/3212)

For high-cardinality categorical columns, we just integer code the values. For lower-cardinality categorical columns, we one-hot encode them. For numeric columns, we just fill the missing values.

This spawned a few todos:

Implement log_loss / neg_log_loss (may just do that this afternoon)
https://github.com/dask/dask/issues/3090 would be useful instead of the values_extractor. This would also help in general with implementing support for dask dataframe in some estimators that only handle dask arrays (e.g. QuantileTransformer).
There's a noticeable delay between calling pipe.fit(X, y) and things showing up on the dashboard. I think we're submitting a very large number of tasks (4600 calls to getitem alone). See what's going on here.
Our pre-processing transformers should gracefully handle missing values like scikit-learn is starting to with 0.20.

Once all the dependent PRs are in, I'll probably try it out on a cluster to see how things look.

mrocklin commented 6 years ago

Nice notebook!

I'm curious, what does the score mean in this case? Given the imbalance that @stsievert noticed (3% click rate) I wonder if a score of 0.95 is good or not :)

Are there ways that we should be handling things given that the output is so imbalanced? For example when we're sampling we might choose a more balanced set. I imagine that there are also fancy weighting things one can do, though I don't have the experience to know what these are.

TomAugspurger commented 6 years ago

The benchmark accuracy (guessing all 0s) is 0.97, so 0.95 is impressively bad :)

I'll implement a log-loss score, and then we can start tuning the hyperparameters of the pipeline.

After that we can investigate strategies for the imbalanced dataset (https://github.com/dask/dask-ml/issues/317)

TomAugspurger commented 5 years ago

For posterity, I blogged about the feature preprocessing part of this on https://tomaugspurger.github.io/sklearn-dask-tabular.html. I plan to work on the hyperparamter optimization part next (not sure when). https://gist.github.com/TomAugspurger/4a058f00b32fc049ab5f2860d03fd579 has the pieces.

mrocklin commented 5 years ago

@TomAugspurger I enjoyed playing with your criteo example notebook. Thanks for pushing that up.

I was starting with a different variant of the dataset that I had stored where the text columns were still stored as text rather than categoricals. Rather than globally convert I decided to stick with this to see what came out of it. I replaced the custom HashingEncoder and the OneHotEncoder with just a HashingVectorizer. Things ran, although the result predicts no clicks on the training set itself with the default hyperparameters.

mrocklin commented 5 years ago

OK, here is a gist with my latest attempt: https://gist.github.com/233810e6813e7fd5b5a40f08bde02758

This depends on a bunch of small PRs recently submitted. I add notes on future work at the bottom of the notebook. I'll reproduce them here.

Future Work

We currently only do a single pass over our data with Incremental.fit. We probably need to replace this with a system that will do multiple passes until we converge.
We probably want to choose the class_weights value more carefully, and there are likely some other parameters to this pipeline that could use tuning
Currently dask-ml has mechanisms for hyper-parameter optimization and incremental training like what we're doing here, but they don't work on full pipelines.
It's tricky working with both dask arrays and dask dataframes in column transformer pipelines. We have to use dask arrays in order to support scipy.sparse matrices from the HashingVectorizer, but we prefer dealing with column names provided by dask dataframes.
Should HashingVectorizer support None values? https://github.com/scikit-learn/scikit-learn/issues/12347

ogrisel commented 5 years ago

An alternative to setting class_weights would be to rebalance the training set to undersample the majority class such that the positive class is represented around 30% of the time for instance. If we do several passes of the training data, each pass could scan all the positive but a different subset of the negatives each time.

To compute the performance metrics on the held out test split (e.g. the last month) such as a precision and recall curve, one should use the full test set without any resampling.

Note that scikit-learn does not have the API to do resampling in a pipeline yet so I would advise to do this resampling loop manually for know.

ogrisel commented 5 years ago

You might also want to tweak the value of the regularizer of the SGDClassifier, that is decreasing the default value of alpha.

Sandy4321 commented 4 years ago

if somebody tried to run locally ? like https://github.com/rambler-digital-solutions/criteo-1tb-benchmark

dask / dask-ml

Case Study: Criteo dataset #295

Thanks. We don't yet have a nearest neighbors algorithm yet, but it's a TODO (also blocking HDBSCAN).

Future Work