ChenglongChen / tensorflow-DeepFM

Tensorflow implementation of DeepFM for CTR prediction.
MIT License
2.04k stars 808 forks source link

memory error for big data #5

Closed yufengwhy closed 6 years ago

yufengwhy commented 6 years ago

I try your code on an ad dataset (https://www.kaggle.com/c/avazu-ctr-prediction/data), it has 40+ million samples with totally 30+ thousand features, which is 100X larger than the one used in this repository.

Then I got memory error, do you have any suggestions? Thank you.

ChenglongChen commented 6 years ago

In this implementation, I load all the data all at once and hold them in the memory. For large dataset, you should probably want to use streaming, i.e., load a batch data, embed it and then train on it. For the embedding step, you have to get the id of each value (i.g,. k-v in the vocabulary). You might have to first scan the whole dataset to construct the vocab. On the other hand, you can also try hashing trick as mentioned in the fasttext paper. This will avoid the pre-scan step of the whole dataset and start training on the fly.

yufengwhy commented 6 years ago

Thank you so much for your answer. I will try.

yufengwhy commented 6 years ago

@ChenglongChen In the fasttext paper: we use a hashing function that maps n-grams to integers in 1 to K.

Suppose there are 10 thousand features, K may be much bigger (say 1000 thousand), then the embedding layer we initialize is 1000 thousand*dimension, which will cost much more memory, right?

ChenglongChen commented 6 years ago

No. K is normally chosen to be much smaller than the original number of unique features. For example, in text classification, say there are 1000 thousand unique ngrams in your corpus. You chose K to be 10 thousand, then every ngram of those 1000 thousand ones will be mapped to an index in the range 1~K via hashing trick. As you can see, there will be hash collision, i.e., different ngram might be mapped to the same index. For more info, you can find hashing trick on wikipedia, or see HashingVectorizer of Sklearn.

yufengwhy commented 6 years ago

@ChenglongChen Thanks for you answer. I just read all these. But I am still confused. If K is normally chosen to be much smaller, then hash collision may be very worse, it is not a good way, because we must map different index to each ngram, right?

ChenglongChen commented 6 years ago

Sorry for the confusion. "K is normally chosen to be much smaller than the original number of unique features" is regarding to dealing with the "out of memory" issue, where you can not hold the entire embedding matrix due to large vocabulary size (a.k.a. K). In general, hashing trick is used to generate feature index on-the-fly which you can use in any machine learning models not just embedding. So, depending on your machine and application, you can choose large K.

"we must map different index to each ngram", it depends. Hash collision can result in some kind of regularization, which might help model generalization.

yufengwhy commented 6 years ago

I see, thank you again~