Closed seiriosPlus closed 3 years ago
您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快~
Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API,FAQ,Github Issue and AI community to get the answer.Have a nice day!
We don't have BF16 Optimizer and training interface. But you can refer to static fp16 training interface: https://github.com/PaddlePaddle/models/blob/release/2.0-beta/PaddleNLP/benchmark/bert/run_pretrain_single.py#L241
amp_list = paddle.fluid.contrib.mixed_precision.AutoMixedPrecisionLists(
custom_white_list=['layer_norm', 'softmax', 'gelu'])
optimizer = paddle.fluid.contrib.mixed_precision.decorate(
optimizer,
amp_list,
init_loss_scaling=args.scale_loss,
use_dynamic_loss_scaling=True)
Besides, maybe you should provide enable_mkldnn
interface instead of FLAGS_use_mkldnn
(see #27935)
Luotao means that we should better not use global variable anymore.
Hi, @luotao1 What is the compiling option to run the recommender models? I have this error when I run the train.py
File "/home/li/repo/Paddle/build/python/paddle/distributed/fleet/base/fleet_base.py", line 1192, in minimize
self._runtime_handle = RuntimeFactory()._create_runtime(context)
File "/home/li/repo/Paddle/build/python/paddle/distributed/fleet/base/runtime_factory.py", line 32, in _create_runtime
ps_runtime = TheOnePSRuntime()
File "/home/li/repo/Paddle/build/python/paddle/distributed/fleet/runtime/the_one_ps.py", line 383, in __init__
self._worker = fluid.core.DistFleetWrapper()
AttributeError: module 'paddle.fluid.core_avx' has no attribute 'DistFleetWrapper
My option is
cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_AVX=ON
Update
cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_AVX=ON -DWITH_DISTRIBUTE=ON
@lidanqing-intel You should use -DWITH_DISTRIBUTE=ON
.
Besides, discussed with @MrChengmo, above models are published in https://github.com/PaddlePaddle/PaddleRec/tree/master/models
How to run rank/dnn
, please see https://github.com/PaddlePaddle/Perf/tree/master/CtrDnn
Regarding to our strategy in enabling bf16 training, We focus on word2vec with goal to reduce memory consumption. We want to do that by enabling bf16 training for most memory consuming ops like: lookup_table . Apart from ops we want to have optimizer working purely on bf16 as well. So ideally to reduce memory usage we will have pure bf16 training without the need of keeping master parameters in fp32.
Hi, @luotao1 With newest develop branch, I can not save models anymore. Could you please give some suggestions ?
Epoch 0 Var LOSS mean_0.tmp_0 - place: CPUPlace
- shape: [1]
- layout: NCHW
- dtype: float
- data: [3.401]
2021-02-19 06:07:15,918 - INFO - Epoch: 0, using time 29.075167655944824 second, ips 35434.53342011419 word/sec.
Traceback (most recent call last):
File "../train.py", line 245, in <module>
benchmark_main.run()
File "../train.py", line 65, in run
self.run_worker()
File "../train.py", line 125, in run_worker
self.infer_target_var)
File "/home/li/miniconda3/envs/myenv_python3.6/lib/python3.6/site-packages/paddle/distributed/fleet/base/fleet_base.py", line 544, in save_inference_model
self._runtime_handle._save_inference_model(
AttributeError: 'NoneType' object has no attribute '_save_inference_model'
Reproduce steps
cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_AVX=ON -DWITH_DISTRIBUTE=ON
make -j 12
cd 2.0benchmark/ps/static/word2vec
python -u ../train.py -c benchmark.yaml
Hi, @luotao1 With newest develop branch, I can not save models anymore. Could you please give some suggestions ?
Epoch 0 Var LOSS mean_0.tmp_0 - place: CPUPlace - shape: [1] - layout: NCHW - dtype: float - data: [3.401] 2021-02-19 06:07:15,918 - INFO - Epoch: 0, using time 29.075167655944824 second, ips 35434.53342011419 word/sec. Traceback (most recent call last): File "../train.py", line 245, in <module> benchmark_main.run() File "../train.py", line 65, in run self.run_worker() File "../train.py", line 125, in run_worker self.infer_target_var) File "/home/li/miniconda3/envs/myenv_python3.6/lib/python3.6/site-packages/paddle/distributed/fleet/base/fleet_base.py", line 544, in save_inference_model self._runtime_handle._save_inference_model( AttributeError: 'NoneType' object has no attribute '_save_inference_model'
Reproduce steps
- Build Paddle
cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_AVX=ON -DWITH_DISTRIBUTE=ON make -j 12
- Run word2vec
cd 2.0benchmark/ps/static/word2vec python -u ../train.py -c benchmark.yaml
There is a compatibility problem between single machine training and PS distributed training,train.py
is designed for distributed training,you can use follow command to run word2vec
cd 2.0benchmark/ps/static/word2vec
fleetrun --worker_num=1 --server_num=1 ../train.py -c benchmark.yaml
We recommend using PaddleRec
to run the model, You can refer to the following links:
Hi @luotao1 Could you please provide a log of fully trained word2vec model for reference. if no CPU, then GPU is fine. Please attach under this issue.
Currently we are enabling bf16 grad ops
Could you please provide a log of fully trained word2vec model for reference. if no CPU, then GPU is fine. Please attach under this issue.
@lidanqing-intel Please see 日志数据.
The log of fleetrun --worker_num=4 --server_num=4 ../train.py -c benchmark.yaml
is Word2Vec DataLoader 4机. We don't have one machine log.
Why we choose this log? You could see following parameters in benchmark.yaml
@luotao1 I have a question related to command fleetrun
,
I checked that when I install paddlepaddle by pip it works fine.
Unfortunately, when I was building the paddle from the source with the mentioned command
cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_AVX=ON -DWITH_DISTRIBUTE=ON
,
it shows me that the command fleetrun: command not found
Should anything be done to make fleetrun command available?
@wozna Please check your python install path. fleetrun
will be installed in python.
https://github.com/PaddlePaddle/Paddle/blob/ffbf71359a260031f4202dd4e6bab7efebaa90da/python/setup.py.in#L542-L544
Or you can specific your own python like /usr/local/python -m fleetrun ...
@luotao1 Some update. So we are to enable bf16 training with word2vec. As a first milestone is to have bf16 training enabled with master weights using AMP (automatic mixed precision training) with lookup_table, elementwise_add, reshape ops. This will not redce memory consumption but we will check if our bf16 functionality works fine. After this we will go for bf16 training of word2vec but without usage of fp32 master weights e.g. data will be initialized as bf16. This require some creation of initializers for bf16 data and some othere changes.
@luotao1 Please note PR https://github.com/PaddlePaddle/Paddle/pull/31093 it does add initial support of BF16 to AMP (automatic mixed precision) .
@luotao1, @MrChengmo We able to run word2vec training via fleetrun , but after training is finished we can see : server_log.0 there was SIGTERM signal sent to process. My question is is this behaviour expected? Log is below:
+=======================================================================================+
| PaddleRec Benchmark Envs Value |
+---------------------------------------------------------------------------------------+
| hyper_parameters.neg_num 5 |
| hyper_parameters.optimizer.decay_rate 0.999 |
| hyper_parameters.optimizer.decay_steps 100000 |
|hyper_parameters.optimizer.learning_rate 1.0 |
| hyper_parameters.sparse_feature_dim 300 |
| hyper_parameters.sparse_feature_number 354051 |
| hyper_parameters.window_size 5 |
| hyper_parameters.with_shuffle_batch False |
| static_benchmark.batch_size 100 |
| static_benchmark.dataset_debug False |
| static_benchmark.epochs 2 |
| static_benchmark.example_count_method word |
| static_benchmark.geo_step 400 |
| static_benchmark.model_path .//static_model.py |
| static_benchmark.pipe_command python .//static_reader.py |
| static_benchmark.print_period 1000 |
| static_benchmark.reader_path .//static_reader.py |
| static_benchmark.reader_type QueueDataset |
| static_benchmark.save_model_path .//model |
| static_benchmark.split_file_list False |
| static_benchmark.sync_mode async |
| static_benchmark.test_data_path .//test_data |
| static_benchmark.thread_num 1 |
| static_benchmark.train_data_path .//train_data |
| static_benchmark.use_cuda 0 |
| static_benchmark.word_count_dict_path .//dict/word_count_dict.txt |
| static_benchmark.word_id_dict_path .//dict/word_id_dict.txt |
| workspace ./ |
| yaml_path benchmark.yaml |
+=======================================================================================+
2021-03-18 09:27:14,627 - INFO - cpu_num: 1
2021-03-18 09:27:14,627 - INFO - -- Role: PSERVER --
sync_mode: async
decay_steps: 100000
Epoch 0: ExponentialDecay set learning rate to 1.0.
/home/jczaja/Paddle/build-relwithdebinfo/python/paddle/distributed/fleet/base/fleet_base.py:632: UserWarning: It is recommended to use DistributedStrategy in fleet.init(). The strategy here is only for compatibility. If the strategy in fleet.distributed_optimizer() is not None, then it will overwrite the DistributedStrategy in fleet.init(), which will take effect in distributed training.
"It is recommended to use DistributedStrategy "
/home/jczaja/Paddle/build-relwithdebinfo/python/paddle/fluid/incubate/fleet/parameter_server/ir/public.py:1201: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
% lr_decay_steps)
2021-03-18 09:27:14,659 - WARNING - ExponentialDecay is set, staircase = True, global learning rate decay step is [ 100000 ], Change decay steps as follow:
strategy = paddle.distributed.fleet.DistributedStrategy()
strategy.a_sync = True
strategy.a_sync_configs= { 'lr_decay_steps' : YOUR_DECAY_STEP }
2021-03-18 09:27:14,659 - INFO - Run Server Begin
server:
server_param {
downpour_server_param {
service_param {server_class: "BrpcPsServer" client_class: "BrpcPsClient" service_class: "BrpcPsService" start_server_port: 0 server_thread_num: 12
}
downpour_table_param {table_id: 0 table_class: "CommonSparseTable" shard_num: 256 type: PS_SPARSE_TABLE
accessor {accessor_class: "CommMergeAccessor" fea_dim: 354051 embedx_dim: 300
}
common {name: "sgd" table_name: "emb" entry: "none" trainer_num: 1 sync: false params: "Param" params: "LearningRate" dims: 300 dims: 1 initializers: "uniform_random&0&-0.0016666667070239782&0.0016666667070239782" initializers: "fill_constant&1.0"
}
}
downpour_table_param {table_id: 1 table_class: "CommonSparseTable" shard_num: 256 type: PS_SPARSE_TABLE
accessor {accessor_class: "CommMergeAccessor" fea_dim: 354051 embedx_dim: 1
}
common {name: "sgd" table_name: "emb_b" entry: "none" trainer_num: 1 sync: false params: "Param" params: "LearningRate" dims: 1 dims: 1 initializers: "fill_constant&0.0" initializers: "fill_constant&1.0"
}
}
downpour_table_param {table_id: 2 table_class: "CommonSparseTable" shard_num: 256 type: PS_SPARSE_TABLE
accessor {accessor_class: "CommMergeAccessor" fea_dim: 354051 embedx_dim: 300
}
common {name: "sgd" table_name: "emb_w" entry: "none" trainer_num: 1 sync: false params: "Param" params: "LearningRate" dims: 300 dims: 1 initializers: "fill_constant&0.0" initializers: "fill_constant&1.0"
}
}
downpour_table_param {table_id: 3 table_class: "GlobalStepTable" shard_num: 256 type: PS_OTHER_TABLE
accessor {accessor_class: "CommMergeAccessor" fea_dim: 0 embedx_dim: 0
}
tensor {feed_var_name: "@LR_DECAY_COUNTER@" fetch_var_name: "tmp_3" startup_program_id: 0 main_program_id: 1 tensor_table_class: "GlobalStepTable"
}
common {name: "" table_name: "@LR_DECAY_COUNTER@" trainer_num: 1 sync: false
}
}
downpour_table_param {table_id: 4 table_class: "BarrierTable" shard_num: 256 type: PS_OTHER_TABLE
accessor {accessor_class: "CommMergeAccessor" fea_dim: 0 embedx_dim: 0
}
common {name: "" table_name: "barrier_table" trainer_num: 1 sync: false
}
}
}
}
I0318 09:27:14.673756 167466 service.cc:50] Init With Gflags:
I0318 09:27:17.066890 167466 server.cpp:1037] Server[paddle::distributed::BrpcPsService] is serving on port=52681.
I0318 09:27:17.067734 167466 server.cpp:1040] Check out http://broncos-clx01.jf.intel.com:52681 in web browser.
W0318 09:27:17.070726 167466 env.h:179] ps-host :127.0.0.1:52681, rank:0 already register, ignore register
W0318 09:32:21.994891 167586 socket.cpp:1739] Fail to keep-write into fd=12 SocketId=565@127.0.0.1:55290@52681: Broken pipe [32]
W0318 09:32:21.994884 167595 input_messenger.cpp:222] Fail to read from fd=12 SocketId=565@127.0.0.1:55290@52681: Connection reset by peer [104]
W0318 09:33:30.230756 167549 input_messenger.cpp:222] Fail to read from fd=10 SocketId=454@127.0.0.1:55288@52681: Connection reset by peer [104]
W0318 09:33:30.230809 167622 socket.cpp:1739] Fail to keep-write into fd=10 SocketId=454@127.0.0.1:55288@52681: Broken pipe [32]
W0318 09:39:26.518344 167587 input_messenger.cpp:222] Fail to read from fd=11 SocketId=1017@127.0.0.1:56008@52681: Connection reset by peer [104]
W0318 09:39:26.518379 167573 socket.cpp:1739] Fail to keep-write into fd=11 SocketId=1017@127.0.0.1:56008@52681: Broken pipe [32]
W0318 09:40:34.684377 167586 socket.cpp:1739] Fail to keep-write into fd=9 SocketId=904@127.0.0.1:56004@52681: Connection reset by peer [104]
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0 paddle::distributed::FleetWrapper::RunServer(std::string const&, unsigned int)
1 paddle::distributed::BrpcPsServer::start(std::string const&, unsigned int)
2 paddle::framework::SignalHandle(char const*, int)
3 paddle::platform::GetCurrentTraceBackString()
----------------------
Error Message Summary:
----------------------
FatalError: `Termination signal` is detected by the operating system.
[TimeInfo: *** Aborted at 1616085691 (unix time) try "date -d @1616085691" if you are using GNU date ***]
[SignalInfo: *** SIGTERM (@0xada00900028dd4) received by PID 167466 (TID 0x7f9cb47da740) from PID 167380 ***]
I have a question related to initializators. Is the possibility to initialize data in FP16
data type in AMP FP16
, or is data always created in FP32
?
With recent changes to initializers, SGD, some operations in fwd and bkwd (already in develop/release 2.1 branches) you can use pure BF16 mode. It allows to convert a model operations, tensors and parameters to BF16.
Pure mode is part of AMP concept used in paddle.static.amp.bf16 module for mixed precision training. We followed the concept with changes as close to the AMP API as can be and enabling BF16 usage. Pure mode by default enables all registered BF16 ops from Paddle. For operations not implemented for BF16 it uses float op version with casting where needed.
We focused on enabling word2vec model with local run without fleet. We are able to run BF16 word2vec training for a number of iterations, observe the loss decreasing during the training and less memory used.
Essentially there are 2 places for the code change in model. First is decoration of the optimizer and call to amp_init after tensors are initialized. Example model changes needed for use BF16 pure mode in training:
Next steps:
Please note, ON_INFER
and WITH_DISTRIBUTE
should not to be turned on at the same time for compiling.
继续开发,不随版
Paddle的分布式参数服务器训练目前主要面向的是大数据量+浅层模型为主的推荐场景,一般训练在数十上百台高性能CPU服务器。
推荐场景一般使用Embedding的嵌入层表示来表征用户特征,规模在千万至千亿级别,因此会导致PServer端内存的急剧消耗(可能会占用数十T的内存),同时PServer端对于稀疏参数的获取/更新速度也是整个训练的瓶颈之一。
期望用BF降低PServer端内存的占用, 同时期望采用BF加速PServer端稀疏参数的获取/更新速度。