amazon-archives / amazon-dsstne

Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models
Apache License 2.0
4.41k stars 731 forks source link

only about 1000 items out of 500,000 items in prediction result #147

Open lightsailpro opened 6 years ago

lightsailpro commented 6 years ago

I followed the movielens sample with the exact same config.json file (see below). My dataset has about 5 million users with about 100 million record on about 500,000 items/ titles. Then predict top 20. But in the prediction result, I only get only about 1000 distinct titles / items. I expect many more distinct items since in the dataset, I have about 500,000 of them. Could anyone help?

{ "Version" : 0.7, "Name" : "AE", "Kind" : "FeedForward",
"SparsenessPenalty" : { "p" : 0.5, "beta" : 2.0 },

"ShuffleIndices" : false,

"Denoising" : {
    "p" : 0.2
},

"ScaledMarginalCrossEntropy" : {
    "oneTarget" : 1.0,
    "zeroTarget" : 0.0,
    "oneScale" : 1.0,
    "zeroScale" : 1.0
},
"Layers" : [
    { "Name" : "Input", "Kind" : "Input", "N" : "auto", "DataSet" : "gl_input", "Sparse" : true },
    { "Name" : "Hidden", "Kind" : "Hidden", "Type" : "FullyConnected", "N" : 128, "Activation" : "Sigmoid", "Sparse" : true },
    { "Name" : "Output", "Kind" : "Output", "Type" : "FullyConnected", "DataSet" : "gl_output", "N" : "auto", "Activation" : "Sigmoid", "Sparse" : true }
],

"ErrorFunction" : "ScaledMarginalCrossEntropy"

}

scottlegrand commented 6 years ago

So I'd have to see your data to be sure, but how are your items distributed in your training set?

If it follows a Power Law (or Zipf Distribution), your outputs might be heavily biased towards those 1000 items and the network is just learning that subset. I'm not saying that's the case, but it is the first thing comes to mind here.

scottlegrand commented 6 years ago

Also, try this network out, it gives the best performance on MovieLens I've seen...

{ "Version" : 0.8, "Name" : "AIV NNC", "Kind" : "FeedForward",

"ShuffleIndices" : false,

"ScaledMarginalCrossEntropy" : {
    "oneTarget" : 1.0,
    "zeroTarget" : 0.0,
    "oneScale" : 1.0,
    "zeroScale" : 1.0
},
"Layers" : [
    { "Name" : "Input", "Kind" : "Input", "N" : "auto", "DataSet" : "gl_input", "Sparse" : true }, 
    { "Name" : "Hidden1", "Kind" : "Hidden", "Type" : "FullyConnected", "N" : 1536, "Activation" : "Relu", "Sparse" : false, "pDropout" : 0.37, "WeightInit" : { "Scheme" : "Gaussian", "Scale" : 0.01 } },
    { "Name" : "Hidden2", "Kind" : "Hidden", "Type" : "FullyConnected", "N" : 1536, "Activation" : "Relu", "Sparse" : false, "pDropout" : 0.37, "WeightInit" : { "Scheme" : "Gaussian", "Scale" : 0.01 } },  
    { "Name" : "Hidden3", "Kind" : "Hidden", "Type" : "FullyConnected", "N" : 1536, "Activation" : "Relu", "Sparse" : false, "pDropout" : 0.37, "WeightInit" : { "Scheme" : "Gaussian", "Scale" : 0.01 } },  
    { "Name" : "Output", "Kind" : "Output", "Type" : "FullyConnected",  "DataSet" : "gl_output", "N" : "auto", "Activation" : "Sigmoid", "Sparse" : true , "WeightInit" : { "Scheme" : "Gaussian", "Scale" : 0.01, "Bias" : -10.2 }}
],

"ErrorFunction" : "ScaledMarginalCrossEntropy"

}

lightsailpro commented 6 years ago

I tried the above "AIV NNC" config with my dataset. I got "cudaMalloc GpuBuffer::Allocate failed out of memory" even with batch size 8 on a K40c GPU with 12GB GPU ram. But with the default sample movielens "AE" config, I was able to go with batch size 896. Do I need to adjust network parameter to handle large dataset? I would assume DSSTNE can handle large dataset larger than what I have. Any suggestion will be appreciated.

NNNetwork::NNNetwork: 1 input layer NNNetwork::NNNetwork: 1 output layer train: ../engine/GpuTypes.h:463: void GpuBuffer::Allocate() [with T = float]: Assertion `0' failed. NNWeight::NNWeight: Allocating 5092147200 bytes (1536, 828800) for fully connected weights between layers Input and Hidden1 NNWeight::NNWeight: Allocating 9437184 bytes (1536, 1536) for fully connected weights between layers Hidden1 and Hidden2 NNWeight::NNWeight: Allocating 9437184 bytes (1536, 1536) for fully connected weights between layers Hidden2 and Hidden3 NNWeight::NNWeight: Allocating 5092147200 bytes (828800, 1536) for fully connected weights between layers Hidden3 and Output cudaMalloc GpuBuffer::Allocate failed out of memory

slegrandA9 commented 6 years ago

So if you were on a Pascal or later GPU, you could stream the dataset from system memory, but you are not, so...

How many GPUs do you have in this machine? If more than one try: "mpirun -np 2" at the front of your DSSTNE command and it will spread the model and data across 2 GPUs.

Alternatively, if you have just 1 GPU, try changing 1536 in the network config to 768.

lightsailpro commented 6 years ago

I only have one NVidia K40c GPU. I changed N from 1536 to 256, and batch size to 512. It's running now. Let's see if the new model can predict more items.

scottlegrand commented 6 years ago

Lower your learning rate.

On Nov 20, 2017 4:23 AM, "lightsailpro" notifications@github.com wrote:

The new model is bad. The average error becomes larger and larger with each epoch.

NNNetwork::Train: Detected network divergence, attempting recovery.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/amzn/amazon-dsstne/issues/147#issuecomment-345680718, or mute the thread https://github.com/notifications/unsubscribe-auth/ARNK9qPpA24kuw0bHNf7yhsPc3ipEYCjks5s4W83gaJpZM4QiNMi .

lightsailpro commented 6 years ago

Where do I specify the learning rate for the training, and what value I should start with? I am trying to work around the following divergence issue. Thanks in advance.

NNNetwork::Train: Minibatch@3852288, average error 179.188812, (176.930328 training, 2.258480 regularization), alpha 0.025000 NNNetwork::Train: Minibatch@3852800, average error 696.496094, (694.237549 training, 2.258516 regularization), alpha 0.025000 NNNetwork::Train: Detected network divergence, attempting recovery.

scottlegrand commented 6 years ago

This is the "alpha" parameter to the Train command, looking at the source code (Train.cpp):

// Hyper parameters
float alpha = stof(getOptionalArgValue(argc, argv, "-alpha", "0.025f"));
float lambda = stof(getOptionalArgValue(argc, argv, "-lambda",

"0.0001f")); float mu = stof(getOptionalArgValue(argc, argv, "-mu", "0.5f"));

lambda is your regularization coefficient, and mu is the momentum term for RMSProp, AdaGrad, AdaDelta, and Nesterov Momentum...

Scott

On Mon, Nov 20, 2017 at 7:01 AM, lightsailpro notifications@github.com wrote:

Where do I specify the learning rate for the training, and what value I should start with? I am trying to work around the following divergence issue.

NNNetwork::Train: Minibatch@3852288, average error 179.188812, (176.930328 training, 2.258480 regularization), alpha 0.025000 NNNetwork::Train: Minibatch@3852800, average error 696.496094, (694.237549 training, 2.258516 regularization), alpha 0.025000 NNNetwork::Train: Detected network divergence, attempting recovery.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/amzn/amazon-dsstne/issues/147#issuecomment-345721512, or mute the thread https://github.com/notifications/unsubscribe-auth/ARNK9sWHWn3M_keH-i3YJpYvMeXcNe4pks5s4ZRegaJpZM4QiNMi .

scottlegrand commented 6 years ago

PS looking at my own code, try 0.01 to start...

lightsailpro commented 6 years ago

Thanks, Scott. I am running the training with alpha=0.01 now. I checked the predicted distinct item count from last run (after 1 epoch, diverged), it is surprisingly still 852, which is identical to the distinct item count of the result (converged) using default movielens config file. I am puzzled why two different networks generated exact same count. I am wondering if I did anything wrong in the process. Here is the command I am using.

--dsstne_n_year.txt sample row sample_userid itemid,timestamp:itemid,timestamp
1-1008 8752,1499631783:1473,1499632570 2-459 1170,1480168269 ..........

/home/ml/amazon-dsstne/src/amazon/dsstne/bin/generateNetCDF -d gl_input -i dsstne_n_year.txt -o gl_input.nc -f features_input -s samples_input -c /home/ml/amazon-dsstne/src/amazon/dsstne/bin/generateNetCDF -d gl_output -i dsstne_n_year.txt -o gl_output.nc -f features_output -s samples_input -c /home/ml/amazon-dsstne/src/amazon/dsstne/bin/train -c config.json -i gl_input.nc -o gl_output.nc -n gl.nc -b 512 -e 1 /home/ml/amazon-dsstne/src/amazon/dsstne/bin/predict -b 512 -d gl -i features_input -o features_output -k 24 -n gl.nc -f dsstne_n_year.txt -s recs -r dsstne_n_year.txt

lightsailpro commented 6 years ago

The issue was accidently closed. Reopen it.

lightsailpro commented 6 years ago

Hi, Scott: With alpha=0.01, the job finished successfully. But I still only get 852 distinct items (out of half million items) in the prediction result for all millions of users. Those items are basically most popular items. Can dsstne neural network model only be able to learn most popular items, not concurrent items? To me, the recommendation result is not very helpful, since I can generate most popular items rather easily without a gpu working many hours. I still feel that I may have missed something here. Any help will be appreciated.

scottlegrand commented 6 years ago

So are you filtering out items the viewer has already purchased/viewed?

Second, instead of trying to run this as an autoencoder of the viewing history, you could instead try to predict the most recent (in training) or next (in test) item. This is what was performed here ( https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf). This is done with a different sort of network, but where DSSTNE shines is in its ability to model the full input and output layers of large datasets and in running them with sparse data like recommendations data.

Scott

On Tue, Nov 21, 2017 at 5:25 AM, lightsailpro notifications@github.com wrote:

Hi, Scott: With alpha=0.01, the job finished successfully. But I still only get 852 distinct items (out of half million items) in the prediction result for all millions of users. Those items are basically most popular items. Can dsstne neural network model only be able to learn most popular items, not concurrent items? To me, the recommendation result is not very helpful, since I can generate most popular items rather easily without a gpu working many hours. I still feel that I may miss something here. Any help will be appreciated.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/amzn/amazon-dsstne/issues/147#issuecomment-346025433, or mute the thread https://github.com/notifications/unsubscribe-auth/ARNK9t1bYu2VvngdB0d_lv_kfGfVrfp2ks5s4s8wgaJpZM4QiNMi .

lightsailpro commented 6 years ago

Hi, Scott: From my understanding, with the same "-r" and "-f" parameter value in "predict -b 512 -d gl -i features_input -o features_output -k 24 -n gl.nc -f dsstne_n_year.txt -s recs -r dsstne_n_year.txt", I am actually filtering out the items users already purchased/viewed. To be honest, I am still little bit confused with those command line parameter. So please correct me if I am not using the correct syntax. I feel that my data set is as sparse as dsstne model can handle, and do not want to give up and try something else too early. Also, I am still new to the neural network. Creating a brand new network definition from scratch or a paper is a challenge to me. If you know any predefined and dsstne ready network config.json that may fit my dataset, please let me know. Currently I mainly use the cooccurence model. But I want to try the neural network model, and dsstne is the only one that covers the whole process from data preparation, to prediction generating. Thanks for your suggestions. Ideally, if you can publish a model zoo like other frameworks e.g tensorflow, caffe do, that will be great.

For the click sequence prediction approach, I tried a java based spmf model (http://www.philippe-fournier-viger.com/spmf/) before. But it is memory based and could not handle my dataset size, and I do not want to down-sample my dataset. I have been looking for a neural network version. But no luck so far. I read the paper you mentioned before, but as I mentioned it is daunting to create actual model from scratch and plugin the data. It will be great if we have sample click sequence model implemented using dsstne framework.

scottlegrand commented 6 years ago

Is this data highly confidential? I would love a sample of maybe 5% of your training set if you can provide it in a way to sanitize out any identifiable information besides numbers. It's hard for me to guess what is actually going on right now, but I'm happy to build something to parse this data for you and train and generate recommendations. In that case, I would indeed use this as the generation of a model zoo for people. It gets 53% precision@1 on Movielens, significantly better than collaborative filtering approaches.

lightsailpro commented 6 years ago

Hi, Scott, Thanks for the follow-up. It will be difficult. What I will do is to try the movielens data set again using my script and process, and check the prediction item coverage. precision@1 is most important kpi. In my case, I also need good item coverage in the prediction. This will isolate if there is anything wrong with my script.

rgeorgej commented 6 years ago

The configuration for ScaledMarginalCrossEntropy has to be adjusted when you have lot of popular Items with the configurations you are having you are penalizing both One and Zero the same way. Have you tried changing it to a higher value when you miss a target as one. What it also does not not to penalize if you have a zero

"ScaledMarginalCrossEntropy" : { "oneTarget" : 1.0, "zeroTarget" : 0.0, "oneScale" : 100.0, "zeroScale" : 1.0 },

lightsailpro commented 6 years ago

Hi, rgeorgej : I followed your suggestion and changed OneScale to 100. With the exact same data set, I still only got 852 distinct items in the prediction result set (prediction top 24 items for each user). See below for the config.json file.

{ "Version" : 0.8, "Name" : "AIV NNC", "Kind" : "FeedForward", "ShuffleIndices" : false,

"ScaledMarginalCrossEntropy" : { "oneTarget" : 1.0, "zeroTarget" : 0.0, "oneScale" : 100.0, "zeroScale" : 1.0 }, "Layers" : [ { "Name" : "Input", "Kind" : "Input", "N" : "auto", "DataSet" : "gl_input", "Sparse" : true }, { "Name" : "Hidden1", "Kind" : "Hidden", "Type" : "FullyConnected", "N" : 256, "Activation" : "Relu", "Sparse" : false, "pDropout" : 0.37, "WeightInit" : { "Scheme" : "Gaussian", "Scale" : 0.01 } }, { "Name" : "Hidden2", "Kind" : "Hidden", "Type" : "FullyConnected", "N" : 256, "Activation" : "Relu", "Sparse" : false, "pDropout" : 0.37, "WeightInit" : { "Scheme" : "Gaussian", "Scale" : 0.01 } },
{ "Name" : "Hidden3", "Kind" : "Hidden", "Type" : "FullyConnected", "N" : 256, "Activation" : "Relu", "Sparse" : false, "pDropout" : 0.37, "WeightInit" : { "Scheme" : "Gaussian", "Scale" : 0.01 } },
{ "Name" : "Output", "Kind" : "Output", "Type" : "FullyConnected", "DataSet" : "gl_output", "N" : "auto", "Activation" : "Sigmoid", "Sparse" : true , "WeightInit" : { "Scheme" : "Gaussian", "Scale" : 0.01, "Bias" : -10.2 }} ],

"ErrorFunction" : "ScaledMarginalCrossEntropy" }

scottlegrand commented 6 years ago

Another method to try is to bagging on your data before feeding it to DSSTNE, sampling your training set inversely proportional to how popular its targets are. This should make the rarer stuff effectively less rare. Also, don't use the OneScale at 100 on the 3-layer dropout network I gave you, it works better for the network with a single hidden layer.

bkj commented 6 years ago

@scottlegrand Can you explain a little more about what you mean by "53% precision@1 on Movielens"? What's was the prediction task there exactly?

EDIT: I'm guessing it's related to slides 47 and 48 here