apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.79k stars 6.79k forks source link

Support for Partitioned Variables (used in large sparse models) #17591

Open QimingZheng opened 4 years ago

QimingZheng commented 4 years ago

Description

Support partitioned variables when training models with large embedding layers (e.g. recommendation system).

Motivation

Models [e.g. Factorization Machines(1) or DeepFM(2)] in recommendation tasks are usually very large, billions of features (including user id and product id) are used.

In the setting of distributed training, if each worker holds a local copy of the embedding parameter, it could easily exceed the CPU-Mem constraint of one server. It's more appropriate to shard the embedding variable to multiple servers and manage each partition with the parameter servers, which is exactly what TF is doing now when training large sparse models [3].

This motivation is also discussed in section 4.2 in TF-OSDI paper [4]. For example, TF manages large embedding layer by:

params= tf.get_variable("embedding", shape=(10000, 128), dtype=tf.float32, \
                         partitioner = tf.min_max_variable_partitioner(\
                         max_partitions=num_ps_replicas,\
                         axis=0))

tf.nn.embedding_lookup(params, ids, max_norm=None, name=None)

# params: a list of tensors all of same shape except for the first dimension,
# representing sharded embedding tensors, or PartitionedVariable

So there is only one copy of the embedding layer globally in TF. While for MXNET, each worker contains one copy + PS maintains one copy, totally there are N+1 copies (N workers). This causes large memory consumption and makes the large model (larger than the CPU-MEM size of one server) training infeasible.

So far as I know, MXNET has no equivalent concept of partitioned variables. Is it expected to be implemented in the near future?

References

  1. Rendle, Steffen. "Factorization machines." 2010 IEEE International Conference on Data Mining. IEEE, 2010. https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf.
  2. Guo, Huifeng, et al. "DeepFM: a factorization-machine based neural network for CTR prediction." arXiv preprint arXiv:1703.04247 (2017). https://arxiv.org/abs/1703.04247.
  3. Embedding and Partitioned Variable in TF 2.0. https://github.com/tensorflow/community/blob/master/rfcs/20190116-embedding-partitioned-variable.md.
  4. Martín Abadi, et al. "TensorFlow: A System for Large-Scale Machine Learning." 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16). https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf.
QimingZheng commented 4 years ago

@mxnet-label-bot add [Sparse]

QimingZheng commented 4 years ago

@mxnet-label-bot add [Distributed]

QimingZheng commented 4 years ago

@eric-haibin-lin Could you help to take a look?

eric-haibin-lin commented 4 years ago

@QimingZheng are looking into this for research purpose or deploying models for real-world use cases? Currently TF has the best support for this kind of models.

Is your target model size larger than what a single machine can hold?

QimingZheng commented 4 years ago

Hi, @eric-haibin-lin, for both research purpose and production requirements.

My target model size will be hundreds of Gigabytes (cannot be handled by one server).