amazon-archives / amazon-dsstne

Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models
Apache License 2.0
4.41k stars 730 forks source link

Classification versus Recommendation #47

Open skilgall opened 8 years ago

skilgall commented 8 years ago

I followed your tutorial and wanted to apply dsstne to a different project. It seems that the only output of the predict method is one that generates recommendations with the trained net and I want classification output.

I tried training a feedforward network with the output layer being classification data to all my training instances, but the output it generated doesn't seem right. I am hoping there is a predict method call for this situation.

  1. Is there a way to produce classification output from a trained network?
  2. Is there a max number of features for a training instance? (I tried 20,000 initially but got a 'std::bad_alloc' error. 10,000 produced no error)
romerocesar commented 8 years ago

Can you share the input and output of your call to predict?

Cesar

On Fri, Jul 8, 2016 at 11:52 AM, skilgall notifications@github.com wrote:

I followed your tutorial and wanted to apply dsstne to a different project. It seems that the only output of the predict method is one that generates recommendations with the trained net and I want classification output.

I tried training a feedforward network with the output layer being classification data to all my training instances, but the output it generated doesn't seem right. I am hoping there is a predict method call for this situation.

1.

Is there a way to produce classification output from a trained network? 2.

Is there a max number of features for a training instance? (I tried 20,000 initially but got a 'std::bad_alloc' error. 10,000 produced no error)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/amznlabs/amazon-dsstne/issues/47, or mute the thread https://github.com/notifications/unsubscribe/AAcM6IPi_mWuQisnkNuPPGrK5OIltt11ks5qTpx3gaJpZM4JITM5 .

skilgall commented 8 years ago

Input file look like this: Document1 1:1:0:0:0... Document2 1:0:1:0:0...

Output file looks like this: Document1 1 Document2 0

I use the generateNetCDF method the same way as the tutorial only using the output file in the output call. I've tried two different versions of the predict call: predict -b 1024 -d gl -i features_input -o features_output -k 10 -n gl.nc -s recs -r input_file -f input_file predict -b 1024 -d gl -i features_input -o features_output -n gl.nc -s recs -r input_file -f input_file -l Output

Either way I get an output in the recs file that looks like this for all documents: Document1 1,0.000:0,0.000: Document2 1,0.000:0,0.000:

skilgall commented 8 years ago

Do you have any suggestions to get non zero output from the net?

Is there anything wrong with my call to predict?

Thanks in advance

rgeorgej commented 8 years ago

I think there is an issue with that way you have been training. Since it is Optimized for Sparse kernel. You just need to pass the label of the index which is 1. You dont need pass the label which is zero. From the above example you have given it is assumed that the label 1 and label 0 is active all the time. Instead can you pass the only the indexes which are 1.

skilgall commented 8 years ago

I took your advice and following the example input data I used in my first post converted it to: Document1 1,1:2,1:... Document2 1,1:3,1:...

I've tried multiple output formats, this output is the only one that returns a non 1 or 0 output Document1 1,1 <- Class Value 1 Document2 2,1 <- Class Value 2

(I've also tried these formats with no success: Document1 1,1:2,0 Document2 1,0:2,1 Document1 1,0 Document2 1,1)

This gives output 1,0.788:2,0.992: for every single instance. (The other config formats give 1,1.000:2,1.000 for all instances or 1,1.0000 for every instance) The number of epochs affects the output in the wrong way, error goes up when epochs increase from 10 to 100.

Can you see anything wrong with my methods? I don't understand how the output should be formatted in a classification example and especially that I am getting the same prediction for each instance.

rgeorgej commented 8 years ago

Can you try Using the following as Input Document1 1:2 Document2 t1:3

And Output as

Document1 1 Document2 2

Ensure that the separator between Document and the features is a tab. Also can you send us the command you tried and attach your sample document

skilgall commented 8 years ago

I tried that exact input and output file details and received this for every instance: 1,0.000:2,0.000:

These are the commands that I have been using: generateNetCDF -d gl_input -i inputfile -o gl_input.nc -f features_input -s samples_input -c generateNetCDF -d gl_output -i outputfile -o gl_output.nc -f features_output -s samples_input -c

train -c config.json -i gl_input.nc -o gl_output.nc -n gl.nc -b 256 -e 10

predict -b 1024 -d gl -i features_input -o features_output -k 10 -n gl.nc -s recs -r inputfile -f inputfile

Attached are my input and output files Archive.zip

scottlegrand commented 8 years ago

You have an interesting case here. You have 1000 input features, of which an average of 256 are on for a given datapoint. I am guessing the sparse kernels here will not behave efficiently, but I do believe I can detect this situation and still keep storage efficient. I am working on a simple program to build your data set correctly.

scottlegrand commented 8 years ago

So since I don't have write access to the github repo, I'm attaching a short program to create DSSTNE-compatible data.

Observations:

  1. DSSTNE was designed for data density of 1% or less. Your input example is just under 26%. What this means is that you may have run into a performance blackhole for DSSTNE. I can fix that, the Movielens 20M example (which is 0.4% dense) accidentally works itself around the problem by exploiting sparse storage efficiency but not the sparse training code.
  2. I am going to build a CSV importer for DSSTNE. I'm going to base it on the CSV file used in the TensorFlow example "wide and deep" https://research.googleblog.com/2016/06/wide-deep-learning-better-together-with.html. Is there any other format for which you'd like import capability?
  3. Any code changes I make will most likely end up in my fork of the code at https://github.com/scottlegrand/amazon-dsstne-1 because the Amazon DSSTNE guys are too busy to process push requests ATM.
scottlegrand commented 8 years ago

dparse.cpp and a slightly modified config_1000.json attached here. To build dparse, type:

g++ -o dparse dparse.cpp -lnetcdf_c++4 -lnetcdf -lm -std=c++0x -L

config_1000.json has been changed to use "input" and "output" as the dataset names.

issue47.zip