Open QimingZheng opened 4 years ago
@mxnet-label-bot add [Sparse]
@mxnet-label-bot add [Distributed]
@eric-haibin-lin Could you help to take a look?
@QimingZheng are looking into this for research purpose or deploying models for real-world use cases? Currently TF has the best support for this kind of models.
Is your target model size larger than what a single machine can hold?
Hi, @eric-haibin-lin, for both research purpose and production requirements.
My target model size will be hundreds of Gigabytes (cannot be handled by one server).
Description
Support partitioned variables when training models with large embedding layers (e.g. recommendation system).
Motivation
Models [e.g. Factorization Machines(1) or DeepFM(2)] in recommendation tasks are usually very large, billions of features (including user id and product id) are used.
In the setting of distributed training, if each worker holds a local copy of the embedding parameter, it could easily exceed the CPU-Mem constraint of one server. It's more appropriate to shard the embedding variable to multiple servers and manage each partition with the parameter servers, which is exactly what TF is doing now when training large sparse models [3].
This motivation is also discussed in section 4.2 in TF-OSDI paper [4]. For example, TF manages large embedding layer by:
So there is only one copy of the embedding layer globally in TF. While for MXNET, each worker contains one copy + PS maintains one copy, totally there are N+1 copies (N workers). This causes large memory consumption and makes the large model (larger than the CPU-MEM size of one server) training infeasible.
So far as I know, MXNET has no equivalent concept of partitioned variables. Is it expected to be implemented in the near future?
References