模型训练报错paddle.fluid.core_noavx has no attribute 'c_broadcast'

Macxy2018 commented 1 year ago

请提出你的问题 Please ask your question

环境：硬件环境为ARMv8架构cpu机器(无GPU)，使用容器启动，ubuntu18.04为基础镜像，python为3.7.13，容器中创建虚拟环境，并按要求安装pip依赖，安装的是paddle 2.3版。

现象：编译过程无报错问题，安装无问题，单核训练无问题，使用paddlenlp中的分布式训练命令报错。

python分布式训练命令：python -m paddle.distributed.launch --nproc_per_node=2 --backend='gloo' xxxx.py

报错信息：paddle.fluid.core_noavx has no attribute 'c_broadcast'

当前paddle编译过程命令如下： git clone https://github.com/PaddlePaddle/Paddle.git

cd Paddle

git checkout release/2.3

mkdir build && cd build

ulimit -n 4096

export PADDLE_VERSION=2.3.0

cmake .. -DPY_VERSION=3.7.13 -DPYTHON_EXECUTABLE=which python3 -DWITH_ARM=ON -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DON_INFER=ON -DWITH_XBYAK=OFF -DPYTHON_INCLUDE_DIR=$(python3 -c "from distutils.sysconfig import get_python_inc; print(get_python_inc())") -DPYTHON_LIBRARY=$(python3 -c "import distutils.sysconfig as sysconfig; print(sysconfig.get_config_var('LIBDIR'))") -DWITH_GLOO=ON

make TARGET=ARMV8 -j$(nproc)

paddle-bot[bot] commented 1 year ago

您好，我们已经收到了您的问题，会安排技术人员尽快解答您的问题，请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时，您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快～

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API，FAQ，Github Issue and AI community to get the answer.Have a nice day!

zoooo0820 commented 1 year ago

你好，请问是这个任务吗，使用更新一些的paddle版本是否仍然会报错？

Macxy2018 commented 1 year ago

你好，请问是这个任务吗，使用更新一些的paddle版本是否仍然会报错？

是这个任务：https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/information_extraction/text，目前2.4新版本的paddle还在编译中，现在使用的是armv8编译的cpu的2.3版，训练启动的时候命令为python3 -m paddle.distributed.launch --nproc_per_node=8 --backend='gloo' finetune.py

Macxy2018 commented 1 year ago

你好，请问是这个任务吗，使用更新一些的paddle版本是否仍然会报错？

是这个任务：https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/information_extraction/text，目前2.4新版本的paddle还在编译中，现在使用的是armv8编译的cpu的2.3版，训练启动的时候命令为python3 -m paddle.distributed.launch --nproc_per_node=8 --backend='gloo' finetune.py

在编译2.3版本的时候增加了--DWITH_DISTRIBUTE=ON，整体cmake命令如下： cmake .. -DPY_VERSION=3.7.13 -DPYTHON_EXECUTABLE=which python3 -DWITH_ARM=ON -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DON_INFER=ON -DWITH_XBYAK=OFF -DPYTHON_INCLUDE_DIR=$(python3 -c "from distutils.sysconfig import get_python_inc; print(get_python_inc())") -DPYTHON_LIBRARY=$(python3 -c "import distutils.sysconfig as sysconfig; print(sysconfig.get_config_var('LIBDIR'))") -DWITH_GLOO=ON

现在运行起来后报错，报错信息如下： ERROR 2023-02-27 16:07:13,704 launch_utils.py:642] ABORT!!! Out of all 8 trainers, the trainer process with rank=[1, 6, 7] was aborted. Please check its log.

zoooo0820 commented 1 year ago

@Macxy2018 辛苦查看下worker对应id1，6，7的日志文件，看看具体报错原因呢

Macxy2018 commented 1 year ago

整体上么有报错提示信息 /venv/paddle/lib/python3.7/site- warnings.warn("Setuptools [33m[2023-02-27 10:48:44,788] [32m[2023-02-27 10:48:44,789] [ [32m[2023-02-27 10:48:44,789] [ [32m[2023-02-27 10:48:44,789] [ [32m[2023-02-27 10:48:44,790] [ [32m[2023-02-27 10:48:44,790] [ [32m[2023-02-27 10:48:44,790] [ [32m[2023-02-27 10:48:44,790] [ [32m[2023-02-27 10:48:44,790] [ [32m[2023-02-27 10:48:44,790] [ [32m[2023-02-27 10:48:44,790] [ [32m[2023-02-27 10:48:44,791] [ [32m[2023-02-27 10:48:44,791] [ [32m[2023-02-27 10:48:44,791] [ [32m[2023-02-27 10:48:44,791] [ [32m[2023-02-27 10:48:44,791] [ [33m[2023-02-27 10:48:46,324] [32m[2023-02-27 10:48:46,324] [ [32m[2023-02-27 10:48:46,325] [ [32m[2023-02-27 10:48:46,360] [ [32m[2023-02-27 10:48:46,360] [ [32m[2023-02-27 10:48:46,362] [ "attention_probs_dropout_prob": "enable_recompute": false, "fuse": false, "hidden_act": "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": "model_type": "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 0, "paddlenlp_version": null, "pool_act": "tanh", "task_id": 0, "task_type_vocab_size": 3, "type_vocab_size": 4, "use_task_id": true, "vocab_size": 40000 } [0m [32m[2023-02-27 10:48:57,758] [ [0m [32m[2023-02-27 10:48:57,759] [ If your task is similar [32m[2023-02-27 10:48:57,818] [ [32m[2023-02-27 10:48:57,819] [ [32m[2023-02-27 10:48:57,819] [ [32m[2023-02-27 10:48:57,819] [ [32m[2023-02-27 10:48:57,819] [ [32m[2023-02-27 10:48:57,820] [ [32m[2023-02-27 10:48:57,820] [ [32m[2023-02-27 10:48:57,820] [ [32m[2023-02-27 10:48:57,820] [ [32m[2023-02-27 10:48:57,820] [ [32m[2023-02-27 10:48:57,820] [ [32m[2023-02-27 10:48:57,820] [ [32m[2023-02-27 10:48:57,821] [ [32m[2023-02-27 10:48:57,821] [ [32m[2023-02-27 10:48:57,821] [ [32m[2023-02-27 10:48:57,821] [ [32m[2023-02-27 10:48:57,821] [ [32m[2023-02-27 10:48:57,821] [ [32m[2023-02-27 10:48:57,821] [ [32m[2023-02-27 10:48:57,822] [ [32m[2023-02-27 10:48:57,822] [ [32m[2023-02-27 10:48:57,822] [ [32m[2023-02-27 10:48:57,822] [ [32m[2023-02-27 10:48:57,822] [ [32m[2023-02-27 10:48:57,822] [ [32m[2023-02-27 10:48:57,822] [ [32m[2023-02-27 10:48:57,822] [ [32m[2023-02-27 10:48:57,823] [ [32m[2023-02-27 10:48:57,823] [ [32m[2023-02-27 10:48:57,823] [ [32m[2023-02-27 10:48:57,823] [ [32m[2023-02-27 10:48:57,823] [ [32m[2023-02-27 10:48:57,823] [ [32m[2023-02-27 10:48:57,823] [ [32m[2023-02-27 10:48:57,824] [ [32m[2023-02-27 10:48:57,824] [ [32m[2023-02-27 10:48:57,824] [ [32m[2023-02-27 10:48:57,824] [ [32m[2023-02-27 10:48:57,824] [ [32m[2023-02-27 10:48:57,824] [ [32m[2023-02-27 10:48:57,824] [ [32m[2023-02-27 10:48:57,825] [ [32m[2023-02-27 10:48:57,825] [ [32m[2023-02-27 10:48:57,825] [ [32m[2023-02-27 10:48:57,825] [ [32m[2023-02-27 10:48:57,825] [ [32m[2023-02-27 10:48:57,825] [ [32m[2023-02-27 10:48:57,825] [ [32m[2023-02-27 10:48:57,825] [ [32m[2023-02-27 10:48:57,826] [ [32m[2023-02-27 10:48:57,826] [ [32m[2023-02-27 10:48:57,826] [ [32m[2023-02-27 10:48:57,826] [ [32m[2023-02-27 10:48:57,826] [ [32m[2023-02-27 10:48:57,826] [ [32m[2023-02-27 10:48:57,827] [ [32m[2023-02-27 10:48:57,827] [ [32m[2023-02-27 10:48:57,827] [ [32m[2023-02-27 10:48:57,827] [ [32m[2023-02-27 10:48:57,827] [ [32m[2023-02-27 10:48:57,827] [ [32m[2023-02-27 10:48:57,827] [ [32m[2023-02-27 10:48:57,828] [ [32m[2023-02-27 10:48:57,828] [ [32m[2023-02-27 10:48:57,828] [ [32m[2023-02-27 10:48:57,828] [ [32m[2023-02-27 10:48:57,828] [ [32m[2023-02-27 10:48:57,828] [ [32m[2023-02-27 10:48:57,828] [ [32m[2023-02-27 10:48:57,829] [ [32m[2023-02-27 10:48:57,829] [ [32m[2023-02-27 10:48:57,829] [ [32m[2023-02-27 10:48:57,829] [ [32m[2023-02-27 10:48:57,829] [ [32m[2023-02-27 10:48:57,829] [ [32m[2023-02-27 10:48:57,829] [ [32m[2023-02-27 10:48:57,830] [ [32m[2023-02-27 10:48:57,830] [ [32m[2023-02-27 10:48:57,830] [ [32m[2023-02-27 10:48:57,830] [ [32m[2023-02-27 10:48:57,830] [ [32m[2023-02-27 10:48:57,830] [ [32m[2023-02-27 10:48:57,830] [ [32m[2023-02-27 10:48:57,831] [ [32m[2023-02-27 10:48:57,831] [ [32m[2023-02-27 10:48:57,831] [ [32m[2023-02-27 10:48:57,831] [ [32m[2023-02-27 10:48:57,831] [ [32m[2023-02-27 10:48:57,831] [ [32m[2023-02-27 10:48:57,831] [ [32m[2023-02-27 10:48:57,832] [ [32m[2023-02-27 10:48:57,832] [ [32m[2023-02-27 10:48:57,832] [ [32m[2023-02-27 10:49:00,390] [ [32m[2023-02-27 10:49:00,390] [ [32m[2023-02-27 10:49:00,390] [ [32m[2023-02-27 10:49:00,390] [ [32m[2023-02-27 10:49:00,390] [ [32m[2023-02-27 10:49:00,390] [ [32m[2023-02-27 10:49:00,391] [ [32m[2023-02-27 10:49:00,391] [ [32m[2023-02-27 10:49:00,395] [ ……，worker6和7的都是一样的没有报错提示，worker1的日志如下，然后就停了： packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils. is replacing distutils.") [ WARNING][0m - evaluation_strategy reset to IntervalStrategy.STEPS for do_eval is True. you can also set evaluation_strategy='epoch'.[0m INFO][0m - The default value for the training argument --report_to will change in v5 (from all installed integrations to none). In v5, you will need to use --report_to all to get the same behavior as now. You should start updating your code and make this info disappear :-).[0m INFO][0m - ============================================================[0m INFO][0m - Model Configuration Arguments [0m INFO][0m - paddle commit id :a5875319fe3bdd359895f1f6a11faf21df886f88[0m INFO][0m - export_model_dir :./checkpoint_base_1/model_best[0m INFO][0m - model_name_or_path :uie-base[0m INFO][0m - multilingual :False[0m INFO][0m - [0m INFO][0m - ============================================================[0m INFO][0m - Data Configuration Arguments [0m INFO][0m - paddle commit id :a5875319fe3bdd359895f1f6a11faf21df886f88[0m INFO][0m - dev_path :data/dev.txt[0m INFO][0m - max_seq_length :512[0m INFO][0m - train_path :data/train.txt[0m INFO][0m - [0m [ WARNING][0m - Process rank: 1, device: cpu, world_size: 8, distributed training: True, 16-bits training: False[0m INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'uie-base'.[0m INFO][0m - Already cached /root/.paddlenlp/models/uie-base/ernie_3.0_base_zh_vocab.txt[0m INFO][0m - tokenizer config file saved in /root/.paddlenlp/models/uie-base/tokenizer_config.json[0m INFO][0m - Special tokens file saved in /root/.paddlenlp/models/uie-base/special_tokens_map.json[0m INFO][0m - Model config ErnieConfig { 0.1, "gelu", 2048, "ernie", INFO][0m - All model checkpoint weights were used when initializing UIE. INFO][0m - All the weights of UIE were initialized from the model checkpoint at uie-base. to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.[0m INFO][0m - ============================================================[0m INFO][0m - Training Configuration Arguments [0m INFO][0m - paddle commit id :a5875319fe3bdd359895f1f6a11faf21df886f88[0m INFO][0m - _no_sync_in_gradient_accumulation:True[0m INFO][0m - activation_quantize_type :None[0m INFO][0m - adam_beta1 :0.9[0m INFO][0m - adam_beta2 :0.999[0m INFO][0m - adam_epsilon :1e-08[0m INFO][0m - algo_list :None[0m INFO][0m - batch_num_list :None[0m INFO][0m - batch_size_list :None[0m INFO][0m - bf16 :False[0m INFO][0m - bf16_full_eval :False[0m INFO][0m - bias_correction :False[0m INFO][0m - current_device :cpu[0m INFO][0m - dataloader_drop_last :False[0m INFO][0m - dataloader_num_workers :0[0m INFO][0m - device :cpu[0m INFO][0m - disable_tqdm :True[0m INFO][0m - do_compress :False[0m INFO][0m - do_eval :True[0m INFO][0m - do_export :True[0m INFO][0m - do_predict :False[0m INFO][0m - do_train :True[0m INFO][0m - eval_batch_size :8[0m INFO][0m - eval_steps :100[0m INFO][0m - evaluation_strategy :IntervalStrategy.STEPS[0m INFO][0m - flatten_param_grads :False[0m INFO][0m - fp16 :False[0m INFO][0m - fp16_full_eval :False[0m INFO][0m - fp16_opt_level :O1[0m INFO][0m - gradient_accumulation_steps :1[0m INFO][0m - greater_is_better :True[0m INFO][0m - ignore_data_skip :False[0m INFO][0m - input_dtype :int64[0m INFO][0m - input_infer_model_path :None[0m INFO][0m - label_names :['start_positions', 'end_positions'][0m INFO][0m - learning_rate :1e-05[0m INFO][0m - load_best_model_at_end :True[0m INFO][0m - local_process_index :1[0m INFO][0m - local_rank :1[0m INFO][0m - log_level :-1[0m INFO][0m - log_level_replica :-1[0m INFO][0m - log_on_each_node :True[0m INFO][0m - logging_dir :./checkpoint_base_1/model_best/runs/Feb27_10-48-44_b11d0c49d963[0m INFO][0m - logging_first_step :False[0m INFO][0m - logging_steps :10[0m INFO][0m - logging_strategy :IntervalStrategy.STEPS[0m INFO][0m - lr_scheduler_type :SchedulerType.LINEAR[0m INFO][0m - max_grad_norm :1.0[0m INFO][0m - max_steps :-1[0m INFO][0m - metric_for_best_model :eval_f1[0m INFO][0m - minimum_eval_times :None[0m INFO][0m - moving_rate :0.9[0m INFO][0m - no_cuda :False[0m INFO][0m - num_train_epochs :100.0[0m INFO][0m - onnx_format :True[0m INFO][0m - optim :OptimizerNames.ADAMW[0m INFO][0m - output_dir :./checkpoint_base_1/model_best[0m INFO][0m - overwrite_output_dir :True[0m INFO][0m - past_index :-1[0m INFO][0m - per_device_eval_batch_size :8[0m INFO][0m - per_device_train_batch_size :8[0m INFO][0m - prediction_loss_only :False[0m INFO][0m - process_index :1[0m INFO][0m - prune_embeddings :False[0m INFO][0m - recompute :False[0m INFO][0m - remove_unused_columns :True[0m INFO][0m - report_to :['visualdl'][0m INFO][0m - resume_from_checkpoint :None[0m INFO][0m - round_type :round[0m INFO][0m - run_name :./checkpoint_base_1/model_best[0m INFO][0m - save_on_each_node :False[0m INFO][0m - save_steps :100[0m INFO][0m - save_strategy :IntervalStrategy.STEPS[0m INFO][0m - save_total_limit :None[0m INFO][0m - scale_loss :32768[0m INFO][0m - seed :1000[0m INFO][0m - sharding :[][0m INFO][0m - sharding_degree :-1[0m INFO][0m - should_log :False[0m INFO][0m - should_save :False[0m INFO][0m - skip_memory_metrics :True[0m INFO][0m - strategy :dynabert+ptq[0m INFO][0m - train_batch_size :8[0m INFO][0m - use_pact :True[0m INFO][0m - warmup_ratio :0.1[0m INFO][0m - warmup_steps :0[0m INFO][0m - weight_decay :0.0[0m INFO][0m - weight_quantize_type :channel_wise_abs_max[0m INFO][0m - width_mult_list :None[0m INFO][0m - world_size :8[0m INFO][0m - [0m INFO][0m - Running training [0m INFO][0m - Num examples = 570[0m INFO][0m - Num Epochs = 100[0m INFO][0m - Instantaneous batch size per device = 8[0m INFO][0m - Total train batch size (w. parallel, distributed & accumulation) = 64[0m INFO][0m - Gradient Accumulation steps = 1[0m INFO][0m - Total optimization steps = 900.0[0m INFO][0m - Total num train samples = 57000.0[0m INFO][0m - Number of trainable parameters = 117946370[0m

zoooo0820 commented 1 year ago

从这里的信息暂时看不出问题所在，请问使用2.4版本编包还会报错吗

Macxy2018 commented 1 year ago

从这里的信息暂时看不出问题所在，请问使用2.4版本编包还会报错吗

2.4版的编译后，加载了paddle会报了这个错误： (paddle) root@72ce093614d5:~/Setups/PaddleNLP/applications/information_extraction/text# python3 Python 3.7.13 (default, Feb 24 2023, 16:21:25) [GCC 8.4.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import paddle Error: Can not import paddle core while this file exists: /venv/paddle/lib/python3.7/site-packages/paddle/fluid/libpaddle.so Traceback (most recent call last): File "/venv/paddle/lib/python3.7/site-packages/paddle/fluid/core.py", line 274, in from . import libpaddle ImportError: /venv/paddle/lib/python3.7/site-packages/paddle/fluid/libpaddle.so: undefined symbol: shm_unlink

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "", line 1, in File "/venv/paddle/lib/python3.7/site-packages/paddle/init.py", line 25, in from .framework import monkey_patch_variable File "/venv/paddle/lib/python3.7/site-packages/paddle/framework/init.py", line 17, in from . import random # noqa: F401 File "/venv/paddle/lib/python3.7/site-packages/paddle/framework/random.py", line 16, in import paddle.fluid as fluid File "/venv/paddle/lib/python3.7/site-packages/paddle/fluid/init.py", line 36, in from . import framework File "/venv/paddle/lib/python3.7/site-packages/paddle/fluid/framework.py", line 37, in from . import core File "/venv/paddle/lib/python3.7/site-packages/paddle/fluid/core.py", line 333, in if not avx_supported() and libpaddle.is_compiled_with_avx(): NameError: name 'libpaddle' is not defined

PaddlePaddle / Paddle

模型训练报错paddle.fluid.core_noavx has no attribute 'c_broadcast' #50949

请提出你的问题 Please ask your question