Enable BF16 on Paddle Parameter Server Distributed Training

seiriosPlus commented 3 years ago

Paddle的分布式参数服务器训练目前主要面向的是大数据量+浅层模型为主的推荐场景，一般训练在数十上百台高性能CPU服务器。

推荐场景一般使用Embedding的嵌入层表示来表征用户特征，规模在千万至千亿级别，因此会导致PServer端内存的急剧消耗（可能会占用数十T的内存），同时PServer端对于稀疏参数的获取/更新速度也是整个训练的瓶颈之一。

期望用BF降低PServer端内存的占用，同时期望采用BF加速PServer端稀疏参数的获取/更新速度。

paddle-bot-old[bot] commented 3 years ago

您好，我们已经收到了您的问题，会安排技术人员尽快解答您的问题，请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时，您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快～

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API，FAQ，Github Issue and AI community to get the answer.Have a nice day!

luotao1 commented 3 years ago

We don't have BF16 Optimizer and training interface. But you can refer to static fp16 training interface: https://github.com/PaddlePaddle/models/blob/release/2.0-beta/PaddleNLP/benchmark/bert/run_pretrain_single.py#L241

amp_list = paddle.fluid.contrib.mixed_precision.AutoMixedPrecisionLists(
            custom_white_list=['layer_norm', 'softmax', 'gelu'])
optimizer = paddle.fluid.contrib.mixed_precision.decorate(
            optimizer,
            amp_list,
            init_loss_scaling=args.scale_loss,
            use_dynamic_loss_scaling=True)

Besides, maybe you should provide enable_mkldnn interface instead of FLAGS_use_mkldnn (see #27935)

lidanqing-intel commented 3 years ago

Luotao means that we should better not use global variable anymore.

lidanqing-intel commented 3 years ago

Hi, @luotao1 What is the compiling option to run the recommender models? I have this error when I run the train.py

  File "/home/li/repo/Paddle/build/python/paddle/distributed/fleet/base/fleet_base.py", line 1192, in minimize
    self._runtime_handle = RuntimeFactory()._create_runtime(context)
  File "/home/li/repo/Paddle/build/python/paddle/distributed/fleet/base/runtime_factory.py", line 32, in _create_runtime
    ps_runtime = TheOnePSRuntime()
  File "/home/li/repo/Paddle/build/python/paddle/distributed/fleet/runtime/the_one_ps.py", line 383, in __init__
    self._worker = fluid.core.DistFleetWrapper()
AttributeError: module 'paddle.fluid.core_avx' has no attribute 'DistFleetWrapper

My option is

cmake ..  -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_AVX=ON

Update

cmake ..  -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_AVX=ON -DWITH_DISTRIBUTE=ON

luotao1 commented 3 years ago

@lidanqing-intel You should use -DWITH_DISTRIBUTE=ON. Besides, discussed with @MrChengmo, above models are published in https://github.com/PaddlePaddle/PaddleRec/tree/master/models

rank/dnn, rank/wide_deep, recall/word2vec: These three models are already tested by QA
rank/deepfm: Still developing.

How to run rank/dnn, please see https://github.com/PaddlePaddle/Perf/tree/master/CtrDnn

jczaja commented 3 years ago

Regarding to our strategy in enabling bf16 training, We focus on word2vec with goal to reduce memory consumption. We want to do that by enabling bf16 training for most memory consuming ops like: lookup_table . Apart from ops we want to have optimizer working purely on bf16 as well. So ideally to reduce memory usage we will have pure bf16 training without the need of keeping master parameters in fp32.

lidanqing-intel commented 3 years ago

Hi, @luotao1 With newest develop branch, I can not save models anymore. Could you please give some suggestions ?

Epoch 0 Var LOSS        mean_0.tmp_0      - place: CPUPlace
  - shape: [1]
  - layout: NCHW
  - dtype: float
  - data: [3.401]
2021-02-19 06:07:15,918 - INFO - Epoch: 0, using time 29.075167655944824 second, ips 35434.53342011419 word/sec.
Traceback (most recent call last):
  File "../train.py", line 245, in <module>
    benchmark_main.run()
  File "../train.py", line 65, in run
    self.run_worker()
  File "../train.py", line 125, in run_worker
    self.infer_target_var)
  File "/home/li/miniconda3/envs/myenv_python3.6/lib/python3.6/site-packages/paddle/distributed/fleet/base/fleet_base.py", line 544, in save_inference_model
    self._runtime_handle._save_inference_model(
AttributeError: 'NoneType' object has no attribute '_save_inference_model'

Reproduce steps

Build Paddle

cmake ..  -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_AVX=ON -DWITH_DISTRIBUTE=ON
make -j 12

Run word2vec

cd 2.0benchmark/ps/static/word2vec
python -u ../train.py -c benchmark.yaml

MrChengmo commented 3 years ago

Hi, @luotao1 With newest develop branch, I can not save models anymore. Could you please give some suggestions ?

Epoch 0 Var LOSS        mean_0.tmp_0      - place: CPUPlace
  - shape: [1]
  - layout: NCHW
  - dtype: float
  - data: [3.401]
2021-02-19 06:07:15,918 - INFO - Epoch: 0, using time 29.075167655944824 second, ips 35434.53342011419 word/sec.
Traceback (most recent call last):
  File "../train.py", line 245, in <module>
    benchmark_main.run()
  File "../train.py", line 65, in run
    self.run_worker()
  File "../train.py", line 125, in run_worker
    self.infer_target_var)
  File "/home/li/miniconda3/envs/myenv_python3.6/lib/python3.6/site-packages/paddle/distributed/fleet/base/fleet_base.py", line 544, in save_inference_model
    self._runtime_handle._save_inference_model(
AttributeError: 'NoneType' object has no attribute '_save_inference_model'

Reproduce steps

Build Paddle

cmake ..  -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_AVX=ON -DWITH_DISTRIBUTE=ON
make -j 12

Run word2vec

cd 2.0benchmark/ps/static/word2vec
python -u ../train.py -c benchmark.yaml

There is a compatibility problem between single machine training and PS distributed training，train.py is designed for distributed training，you can use follow command to run word2vec

cd 2.0benchmark/ps/static/word2vec
fleetrun --worker_num=1 --server_num=1 ../train.py -c benchmark.yaml

We recommend using PaddleRec to run the model, You can refer to the following links：

lidanqing-intel commented 3 years ago

Hi @luotao1 Could you please provide a log of fully trained word2vec model for reference. if no CPU, then GPU is fine. Please attach under this issue.

lidanqing-intel commented 3 years ago

BF16 Strategy

Now Intel target bf16 training of word2vec now.
Focusing on both enabling bf16, python API, and reducing memory consumption.
Currently we are enabling bf16 grad ops
- Checking working on changes bf16 training python API need. still trying
- Reduce memory consumption by applying non-mkldnn block format in general

BF16 update each week (by @lidanqing-intel ), since Feb 26th

[ ] Reproduced training.
- [x] word2vec
- [ ] other two models
[ ] Enable word2vec related Ops on BF16
- [x] conv_transpose(#30877 merged by Asia)
- [x] elementwise_add grad op (#30925 merged by Jacek)
- [x] matmul
- [x] reshape (#31035 merged by Jacek)
- [x] lookup_table (#31558 merged by Adam)
- [x] elementwise_mul (#31647 merged by Jacek)
- [x] reduce_sum (reduce_max, reduce_min, reduce_mean, reduce_mul FWD merged #31816 , reduce op with grad op merged #32280)
- [x] LSTM and GRU (Not for word2vec model, #31234 merged by jakpiase)
[ ] Adam is working on memory optimizations by using non-mkldnn block format. (#31055)
[x] Artur is making BF16 API #31093

luotao1 commented 3 years ago

Could you please provide a log of fully trained word2vec model for reference. if no CPU, then GPU is fine. Please attach under this issue.

@lidanqing-intel Please see 日志数据.

The log of fleetrun --worker_num=4 --server_num=4 ../train.py -c benchmark.yaml is Word2Vec DataLoader 4机. We don't have one machine log.

Why we choose this log? You could see following parameters in benchmark.yaml

reader_type: "DataLoader"
sync_mode: "async" # sync / async /geo / heter （Not geo）

wozna commented 3 years ago

@luotao1 I have a question related to command fleetrun, I checked that when I install paddlepaddle by pip it works fine.

Unfortunately, when I was building the paddle from the source with the mentioned command cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_AVX=ON -DWITH_DISTRIBUTE=ON , it shows me that the command fleetrun: command not found

Should anything be done to make fleetrun command available?

luotao1 commented 3 years ago

@wozna Please check your python install path. fleetrun will be installed in python. https://github.com/PaddlePaddle/Paddle/blob/ffbf71359a260031f4202dd4e6bab7efebaa90da/python/setup.py.in#L542-L544 Or you can specific your own python like /usr/local/python -m fleetrun ...

jczaja commented 3 years ago

@luotao1 Some update. So we are to enable bf16 training with word2vec. As a first milestone is to have bf16 training enabled with master weights using AMP (automatic mixed precision training) with lookup_table, elementwise_add, reshape ops. This will not redce memory consumption but we will check if our bf16 functionality works fine. After this we will go for bf16 training of word2vec but without usage of fp32 master weights e.g. data will be initialized as bf16. This require some creation of initializers for bf16 data and some othere changes.

jczaja commented 3 years ago

@luotao1 Please note PR https://github.com/PaddlePaddle/Paddle/pull/31093 it does add initial support of BF16 to AMP (automatic mixed precision) .

jczaja commented 3 years ago

@luotao1, @MrChengmo We able to run word2vec training via fleetrun , but after training is finished we can see : server_log.0 there was SIGTERM signal sent to process. My question is is this behaviour expected? Log is below:


    +=======================================================================================+
    |                PaddleRec Benchmark Envs                      Value                    |
    +---------------------------------------------------------------------------------------+
    |                hyper_parameters.neg_num                        5                      |
    |   hyper_parameters.optimizer.decay_rate                      0.999                    |
    |  hyper_parameters.optimizer.decay_steps                     100000                    |
    |hyper_parameters.optimizer.learning_rate                       1.0                     |
    |     hyper_parameters.sparse_feature_dim                       300                     |
    |  hyper_parameters.sparse_feature_number                     354051                    |
    |            hyper_parameters.window_size                        5                      |
    |     hyper_parameters.with_shuffle_batch                      False                    |
    |             static_benchmark.batch_size                       100                     |
    |          static_benchmark.dataset_debug                      False                    |
    |                 static_benchmark.epochs                        2                      |
    |   static_benchmark.example_count_method                      word                     |
    |               static_benchmark.geo_step                       400                     |
    |             static_benchmark.model_path               .//static_model.py              |
    |           static_benchmark.pipe_command           python .//static_reader.py          |
    |           static_benchmark.print_period                      1000                     |
    |            static_benchmark.reader_path               .//static_reader.py             |
    |            static_benchmark.reader_type                  QueueDataset                 |
    |        static_benchmark.save_model_path                    .//model                   |
    |        static_benchmark.split_file_list                      False                    |
    |              static_benchmark.sync_mode                      async                    |
    |         static_benchmark.test_data_path                  .//test_data                 |
    |             static_benchmark.thread_num                        1                      |
    |        static_benchmark.train_data_path                  .//train_data                |
    |               static_benchmark.use_cuda                        0                      |
    |   static_benchmark.word_count_dict_path           .//dict/word_count_dict.txt         |
    |      static_benchmark.word_id_dict_path            .//dict/word_id_dict.txt           |
    |                               workspace                       ./                      |
    |                               yaml_path                 benchmark.yaml                |
    +=======================================================================================+

2021-03-18 09:27:14,627 - INFO - cpu_num: 1
2021-03-18 09:27:14,627 - INFO - -- Role: PSERVER --
sync_mode: async
decay_steps: 100000
Epoch 0: ExponentialDecay set learning rate to 1.0.
/home/jczaja/Paddle/build-relwithdebinfo/python/paddle/distributed/fleet/base/fleet_base.py:632: UserWarning: It is recommended to use DistributedStrategy in fleet.init(). The strategy here is only for compatibility. If the strategy in fleet.distributed_optimizer() is not None, then it will overwrite the DistributedStrategy in fleet.init(), which will take effect in distributed training.
  "It is recommended to use DistributedStrategy "
/home/jczaja/Paddle/build-relwithdebinfo/python/paddle/fluid/incubate/fleet/parameter_server/ir/public.py:1201: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
  % lr_decay_steps)
2021-03-18 09:27:14,659 - WARNING - ExponentialDecay is set, staircase = True, global learning rate decay step is [ 100000 ], Change decay steps as follow: 
     strategy = paddle.distributed.fleet.DistributedStrategy() 
     strategy.a_sync = True 
     strategy.a_sync_configs= { 'lr_decay_steps' : YOUR_DECAY_STEP } 

2021-03-18 09:27:14,659 - INFO - Run Server Begin
server: 
server_param {
    downpour_server_param {
    service_param {server_class: "BrpcPsServer" client_class: "BrpcPsClient" service_class: "BrpcPsService" start_server_port: 0 server_thread_num: 12 
    }
      downpour_table_param {table_id: 0 table_class: "CommonSparseTable" shard_num: 256 type: PS_SPARSE_TABLE
      accessor {accessor_class: "CommMergeAccessor" fea_dim: 354051 embedx_dim: 300 

      }
      common {name: "sgd" table_name: "emb" entry: "none" trainer_num: 1 sync: false params: "Param" params: "LearningRate" dims: 300 dims: 1 initializers: "uniform_random&0&-0.0016666667070239782&0.0016666667070239782" initializers: "fill_constant&1.0" 

      }

      }
      downpour_table_param {table_id: 1 table_class: "CommonSparseTable" shard_num: 256 type: PS_SPARSE_TABLE
      accessor {accessor_class: "CommMergeAccessor" fea_dim: 354051 embedx_dim: 1 

      }
      common {name: "sgd" table_name: "emb_b" entry: "none" trainer_num: 1 sync: false params: "Param" params: "LearningRate" dims: 1 dims: 1 initializers: "fill_constant&0.0" initializers: "fill_constant&1.0" 

      }

      }
      downpour_table_param {table_id: 2 table_class: "CommonSparseTable" shard_num: 256 type: PS_SPARSE_TABLE
      accessor {accessor_class: "CommMergeAccessor" fea_dim: 354051 embedx_dim: 300 

      }
      common {name: "sgd" table_name: "emb_w" entry: "none" trainer_num: 1 sync: false params: "Param" params: "LearningRate" dims: 300 dims: 1 initializers: "fill_constant&0.0" initializers: "fill_constant&1.0" 

      }

      }
      downpour_table_param {table_id: 3 table_class: "GlobalStepTable" shard_num: 256 type: PS_OTHER_TABLE
      accessor {accessor_class: "CommMergeAccessor" fea_dim: 0 embedx_dim: 0 

      }
      tensor {feed_var_name: "@LR_DECAY_COUNTER@" fetch_var_name: "tmp_3" startup_program_id: 0 main_program_id: 1 tensor_table_class: "GlobalStepTable" 

      }
      common {name: "" table_name: "@LR_DECAY_COUNTER@" trainer_num: 1 sync: false 

      }

      }
      downpour_table_param {table_id: 4 table_class: "BarrierTable" shard_num: 256 type: PS_OTHER_TABLE
      accessor {accessor_class: "CommMergeAccessor" fea_dim: 0 embedx_dim: 0 

      }
      common {name: "" table_name: "barrier_table" trainer_num: 1 sync: false 

      }

      }
    }
}
I0318 09:27:14.673756 167466 service.cc:50] Init With Gflags:
I0318 09:27:17.066890 167466 server.cpp:1037] Server[paddle::distributed::BrpcPsService] is serving on port=52681.
I0318 09:27:17.067734 167466 server.cpp:1040] Check out http://broncos-clx01.jf.intel.com:52681 in web browser.
W0318 09:27:17.070726 167466 env.h:179] ps-host :127.0.0.1:52681, rank:0 already register, ignore register
W0318 09:32:21.994891 167586 socket.cpp:1739] Fail to keep-write into fd=12 SocketId=565@127.0.0.1:55290@52681: Broken pipe [32]
W0318 09:32:21.994884 167595 input_messenger.cpp:222] Fail to read from fd=12 SocketId=565@127.0.0.1:55290@52681: Connection reset by peer [104]
W0318 09:33:30.230756 167549 input_messenger.cpp:222] Fail to read from fd=10 SocketId=454@127.0.0.1:55288@52681: Connection reset by peer [104]
W0318 09:33:30.230809 167622 socket.cpp:1739] Fail to keep-write into fd=10 SocketId=454@127.0.0.1:55288@52681: Broken pipe [32]
W0318 09:39:26.518344 167587 input_messenger.cpp:222] Fail to read from fd=11 SocketId=1017@127.0.0.1:56008@52681: Connection reset by peer [104]
W0318 09:39:26.518379 167573 socket.cpp:1739] Fail to keep-write into fd=11 SocketId=1017@127.0.0.1:56008@52681: Broken pipe [32]
W0318 09:40:34.684377 167586 socket.cpp:1739] Fail to keep-write into fd=9 SocketId=904@127.0.0.1:56004@52681: Connection reset by peer [104]

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::distributed::FleetWrapper::RunServer(std::string const&, unsigned int)
1   paddle::distributed::BrpcPsServer::start(std::string const&, unsigned int)
2   paddle::framework::SignalHandle(char const*, int)
3   paddle::platform::GetCurrentTraceBackString()

----------------------
Error Message Summary:
----------------------
FatalError: `Termination signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1616085691 (unix time) try "date -d @1616085691" if you are using GNU date ***]
  [SignalInfo: *** SIGTERM (@0xada00900028dd4) received by PID 167466 (TID 0x7f9cb47da740) from PID 167380 ***]

wozna commented 3 years ago

I have a question related to initializators. Is the possibility to initialize data in FP16 data type in AMP FP16, or is data always created in FP32?

arlesniak commented 3 years ago

With recent changes to initializers, SGD, some operations in fwd and bkwd (already in develop/release 2.1 branches) you can use pure BF16 mode. It allows to convert a model operations, tensors and parameters to BF16.

Pure mode is part of AMP concept used in paddle.static.amp.bf16 module for mixed precision training. We followed the concept with changes as close to the AMP API as can be and enabling BF16 usage. Pure mode by default enables all registered BF16 ops from Paddle. For operations not implemented for BF16 it uses float op version with casting where needed.

We focused on enabling word2vec model with local run without fleet. We are able to run BF16 word2vec training for a number of iterations, observe the loss decreasing during the training and less memory used.

Essentially there are 2 places for the code change in model. First is decoration of the optimizer and call to amp_init after tensors are initialized. Example model changes needed for use BF16 pure mode in training:

word2vec_diff

Next steps:

we are currently working on a performance issue we noticed when we run the word2vec training.
we weren't able to use model save function because of the previously mentioned issues so it is an area to be addressed. For now we turn off the saving and let the training continue.
more operations in BF16

lidanqing-intel commented 3 years ago

Please note, ON_INFER and WITH_DISTRIBUTE should not to be turned on at the same time for compiling.

lidanqing-intel commented 3 years ago

继续开发，不随版

PaddlePaddle / Paddle

Enable BF16 on Paddle Parameter Server Distributed Training #30560

BF16 Strategy

BF16 update each week (by @lidanqing-intel ), since Feb 26th