Lazy sparse param initialization for distributed training

casscw commented 6 years ago

Description

In the model wide&deep，a categorical feature 'sex' has three values '0,1,2'. Then feed it into SparseEmbedding layer, just like the code below.

embed_weight = mx.symbol.Variable('single_2_embed_weight', stype='row_sparse')
mx.symbol.contrib.SparseEmbedding(data=embed, weight=embed_weight , input_dim=3, output_dim=100000)

Environment info (Required)

Linux, mxnet-1.0.1,GPU,python-2.7

Detail

By checking the model's params， the 'single_2_embed_weight' has values in each row which is not sparse

# data is the model's parmas [checkpoint-0000.params]
>>> data = mx.nd.load('model/checkpoint-0000.params')
>>> data['arg:single_2_embed_weight']

<RowSparseNDArray 100000x32 @cpu(0)>
>>> data['arg:single_2_embed_weight'].indices

[    0     1     2 ... 99997 99998 99999]
<NDArray 100000 @cpu(0)>
>>> data['arg:single_2_embed_weight'].data   

[[-0.00308852  0.00507648 -0.00339155 ...  0.00893246  0.00952273
   0.00661108]
 [ 0.00706272  0.00775839 -0.00312851 ... -0.00537404  0.0012673
  -0.00746952]
 [ 0.00495503  0.00214614  0.00021955 ...  0.00096997  0.00675734
  -0.00131243]
 ...
 [-0.00625316 -0.00615439  0.00471845 ... -0.00657647 -0.00150909
  -0.00539171]
 [-0.00908917 -0.00464959 -0.00893743 ... -0.00607294 -0.00969465
  -0.00399319]
 [-0.00455761  0.00734029 -0.00513125 ...  0.00258441  0.0042576
  -0.00019451]]
<NDArray 100000x32 @cpu(0)>

why the categorical input only has three values【or very few values】， it's embedding weight is dense【has values in each row which is not sparse】instead of 【few rows having values】?

Demo

https://gist.github.com/casscw/2e7a436704ead8804261f8b13e84f1a1

moveforever commented 6 years ago

@eric-haibin-lin

moveforever commented 6 years ago

@ZiyueHuang

eric-haibin-lin commented 6 years ago

Lazy initialization for each row in the row_sparse weight parameter is an optimization we haven't done yet. You're right - ideally the weight can be initialized only when a certain category is seen. Currently all rows are filled at one shot during initialization. What sparse embedding provides is the ability to only retrieve the parameters for the categories in the current mini-batch (instead of loading the full model). For example, you might have a full model on CPU. For each mini-batch, only load the rows for the seen categories on GPU and perform forward backward.

Lazy initialization is definitely worth investigations though. What's your use case / application?

casscw commented 6 years ago

Thanks for your explanation. Just wide&deep for recommendation system

moveforever commented 6 years ago

in my application, categorical feature's input dim may be ten millions, but categorical feature actually has two millions. If lazy initialization is adopted, it will save much memory.

xuchen-plus commented 6 years ago

Will there be any development plan on this issue? Lazy initialization for large row sparse ndarray is crucial for use cases like sparse embedding on billions of features (with a minibatch only seeing a few thousand). @eric-haibin-lin

eric-haibin-lin commented 6 years ago

I'm not aware if anyone has spare time to work on this. If you'd like to contribute, I'm happy to discuss design and implementation

pengzhao-intel commented 6 years ago

mark to come back soon

pinaraws commented 5 years ago

@mxnet-label-bot add[Distributed]

apache / mxnet