PaddlePaddle / PaddleClas

A treasure chest for visual classification and recognition powered by PaddlePaddle
Apache License 2.0
5.47k stars 1.17k forks source link

linux training error :( #2767

Closed happybear1015 closed 1 year ago

happybear1015 commented 1 year ago

(paddle_gpu) uvtec@uvtec-MS-7B98:/media/uvtec/9AF4F5A0F4F57EB7/PaddleClas$ python -m paddle.distributed.launch --gpus="0,1" tools/train.py -c ./ppcls/configs/ImageNet/ResNet/ResNet101_vd.yaml LAUNCH INFO 2023-04-24 09:09:32,830 ----------- Configuration ---------------------- LAUNCH INFO 2023-04-24 09:09:32,831 devices: 0,1 LAUNCH INFO 2023-04-24 09:09:32,831 elastic_level: -1 LAUNCH INFO 2023-04-24 09:09:32,831 elastic_timeout: 30 LAUNCH INFO 2023-04-24 09:09:32,831 gloo_port: 6767 LAUNCH INFO 2023-04-24 09:09:32,831 host: None LAUNCH INFO 2023-04-24 09:09:32,831 ips: None LAUNCH INFO 2023-04-24 09:09:32,831 job_id: default LAUNCH INFO 2023-04-24 09:09:32,831 legacy: False LAUNCH INFO 2023-04-24 09:09:32,831 log_dir: log LAUNCH INFO 2023-04-24 09:09:32,831 log_level: INFO LAUNCH INFO 2023-04-24 09:09:32,831 master: None LAUNCH INFO 2023-04-24 09:09:32,831 max_restart: 3 LAUNCH INFO 2023-04-24 09:09:32,831 nnodes: 1 LAUNCH INFO 2023-04-24 09:09:32,831 nproc_per_node: None LAUNCH INFO 2023-04-24 09:09:32,831 rank: -1 LAUNCH INFO 2023-04-24 09:09:32,831 run_mode: collective LAUNCH INFO 2023-04-24 09:09:32,831 server_num: None LAUNCH INFO 2023-04-24 09:09:32,831 servers: LAUNCH INFO 2023-04-24 09:09:32,831 start_port: 6070 LAUNCH INFO 2023-04-24 09:09:32,831 trainer_num: None LAUNCH INFO 2023-04-24 09:09:32,831 trainers: LAUNCH INFO 2023-04-24 09:09:32,831 training_script: tools/train.py LAUNCH INFO 2023-04-24 09:09:32,831 training_script_args: ['-c', './ppcls/configs/ImageNet/ResNet/ResNet101_vd.yaml'] LAUNCH INFO 2023-04-24 09:09:32,831 with_gloo: 1 LAUNCH INFO 2023-04-24 09:09:32,831 -------------------------------------------------- LAUNCH INFO 2023-04-24 09:09:32,831 Job: default, mode collective, replicas 1[1:1], elastic False LAUNCH INFO 2023-04-24 09:09:32,834 Run Pod: gnhppr, replicas 2, status ready LAUNCH INFO 2023-04-24 09:09:32,842 Watching Pod: gnhppr, replicas 2, status running [2023/04/24 09:09:38] ppcls INFO:

== PaddleClas is powered by PaddlePaddle ! ==

== == == For more info please go to the following website. == == == == https://github.com/PaddlePaddle/PaddleClas ==

[2023/04/24 09:09:38] ppcls INFO: Arch : [2023/04/24 09:09:38] ppcls INFO: class_num : 2 [2023/04/24 09:09:38] ppcls INFO: name : ResNet101_vd [2023/04/24 09:09:38] ppcls INFO: DataLoader : [2023/04/24 09:09:38] ppcls INFO: Eval : [2023/04/24 09:09:39] ppcls INFO: dataset : [2023/04/24 09:09:39] ppcls INFO: cls_label_path : ./dataset/maorong_dif_long/val_list.txt [2023/04/24 09:09:39] ppcls INFO: image_root : ./dataset/maorong_dif_long/ [2023/04/24 09:09:39] ppcls INFO: name : ImageNetDataset [2023/04/24 09:09:39] ppcls INFO: transform_ops : [2023/04/24 09:09:39] ppcls INFO: DecodeImage : [2023/04/24 09:09:39] ppcls INFO: channel_first : False [2023/04/24 09:09:39] ppcls INFO: to_rgb : True [2023/04/24 09:09:39] ppcls INFO: ResizeImage : [2023/04/24 09:09:39] ppcls INFO: resize_short : 896 [2023/04/24 09:09:39] ppcls INFO: CropImage : [2023/04/24 09:09:39] ppcls INFO: size : 896 [2023/04/24 09:09:39] ppcls INFO: NormalizeImage : [2023/04/24 09:09:39] ppcls INFO: mean : [0.485, 0.456, 0.406] [2023/04/24 09:09:39] ppcls INFO: order : [2023/04/24 09:09:39] ppcls INFO: scale : 1.0/255.0 [2023/04/24 09:09:39] ppcls INFO: std : [0.229, 0.224, 0.225] [2023/04/24 09:09:39] ppcls INFO: loader : [2023/04/24 09:09:39] ppcls INFO: num_workers : 4 [2023/04/24 09:09:39] ppcls INFO: use_shared_memory : True [2023/04/24 09:09:39] ppcls INFO: sampler : [2023/04/24 09:09:39] ppcls INFO: batch_size : 4 [2023/04/24 09:09:39] ppcls INFO: drop_last : False [2023/04/24 09:09:39] ppcls INFO: name : DistributedBatchSampler [2023/04/24 09:09:39] ppcls INFO: shuffle : False [2023/04/24 09:09:39] ppcls INFO: Train : [2023/04/24 09:09:39] ppcls INFO: dataset : [2023/04/24 09:09:39] ppcls INFO: batch_transform_ops : None [2023/04/24 09:09:39] ppcls INFO: cls_label_path : ./dataset/maorong_dif_long/train_list.txt [2023/04/24 09:09:39] ppcls INFO: image_root : ./dataset/maorong_dif_long/ [2023/04/24 09:09:39] ppcls INFO: name : ImageNetDataset [2023/04/24 09:09:39] ppcls INFO: transform_ops : [2023/04/24 09:09:39] ppcls INFO: DecodeImage : [2023/04/24 09:09:39] ppcls INFO: channel_first : False [2023/04/24 09:09:39] ppcls INFO: to_rgb : True [2023/04/24 09:09:39] ppcls INFO: RandCropImage : [2023/04/24 09:09:39] ppcls INFO: size : 896 [2023/04/24 09:09:39] ppcls INFO: RandFlipImage : [2023/04/24 09:09:39] ppcls INFO: flip_code : 1 [2023/04/24 09:09:39] ppcls INFO: NormalizeImage : [2023/04/24 09:09:39] ppcls INFO: mean : [0.485, 0.456, 0.406] [2023/04/24 09:09:39] ppcls INFO: order : [2023/04/24 09:09:39] ppcls INFO: scale : 1.0/255.0 [2023/04/24 09:09:39] ppcls INFO: std : [0.229, 0.224, 0.225] [2023/04/24 09:09:39] ppcls INFO: loader : [2023/04/24 09:09:39] ppcls INFO: num_workers : 4 [2023/04/24 09:09:39] ppcls INFO: use_shared_memory : True [2023/04/24 09:09:39] ppcls INFO: sampler : [2023/04/24 09:09:39] ppcls INFO: batch_size : 4 [2023/04/24 09:09:39] ppcls INFO: drop_last : False [2023/04/24 09:09:39] ppcls INFO: name : DistributedBatchSampler [2023/04/24 09:09:39] ppcls INFO: shuffle : True [2023/04/24 09:09:39] ppcls INFO: Global : [2023/04/24 09:09:39] ppcls INFO: checkpoints : None [2023/04/24 09:09:39] ppcls INFO: device : gpu [2023/04/24 09:09:39] ppcls INFO: epochs : 200 [2023/04/24 09:09:39] ppcls INFO: eval_during_train : True [2023/04/24 09:09:39] ppcls INFO: eval_interval : 1 [2023/04/24 09:09:39] ppcls INFO: image_shape : [3, 896, 896] [2023/04/24 09:09:39] ppcls INFO: output_dir : ./output/ [2023/04/24 09:09:39] ppcls INFO: pretrained_model : None [2023/04/24 09:09:39] ppcls INFO: print_batch_step : 10 [2023/04/24 09:09:39] ppcls INFO: save_inference_dir : ./inference/ResNet101_vd [2023/04/24 09:09:39] ppcls INFO: save_interval : 1 [2023/04/24 09:09:39] ppcls INFO: use_visualdl : True [2023/04/24 09:09:39] ppcls INFO: Infer : [2023/04/24 09:09:39] ppcls INFO: PostProcess : [2023/04/24 09:09:39] ppcls INFO: class_id_map_file : ppcls/utils/imagenet1k_label_list.txt [2023/04/24 09:09:39] ppcls INFO: name : Topk [2023/04/24 09:09:39] ppcls INFO: topk : 5 [2023/04/24 09:09:39] ppcls INFO: batch_size : 10 [2023/04/24 09:09:39] ppcls INFO: infer_imgs : docs/images/inference_deployment/whl_demo.jpg [2023/04/24 09:09:39] ppcls INFO: transforms : [2023/04/24 09:09:39] ppcls INFO: DecodeImage : [2023/04/24 09:09:39] ppcls INFO: channel_first : False [2023/04/24 09:09:39] ppcls INFO: to_rgb : True [2023/04/24 09:09:39] ppcls INFO: ResizeImage : [2023/04/24 09:09:39] ppcls INFO: resize_short : 896 [2023/04/24 09:09:39] ppcls INFO: CropImage : [2023/04/24 09:09:39] ppcls INFO: size : 896 [2023/04/24 09:09:39] ppcls INFO: NormalizeImage : [2023/04/24 09:09:39] ppcls INFO: mean : [0.485, 0.456, 0.406] [2023/04/24 09:09:39] ppcls INFO: order : [2023/04/24 09:09:39] ppcls INFO: scale : 1.0/255.0 [2023/04/24 09:09:39] ppcls INFO: std : [0.229, 0.224, 0.225] [2023/04/24 09:09:39] ppcls INFO: ToCHWImage : None [2023/04/24 09:09:39] ppcls INFO: Loss : [2023/04/24 09:09:39] ppcls INFO: Eval : [2023/04/24 09:09:39] ppcls INFO: CELoss : [2023/04/24 09:09:39] ppcls INFO: weight : 1.0 [2023/04/24 09:09:39] ppcls INFO: Train : [2023/04/24 09:09:39] ppcls INFO: CELoss : [2023/04/24 09:09:39] ppcls INFO: epsilon : 0.1 [2023/04/24 09:09:39] ppcls INFO: weight : 1.0 [2023/04/24 09:09:39] ppcls INFO: Metric : [2023/04/24 09:09:39] ppcls INFO: Eval : [2023/04/24 09:09:39] ppcls INFO: TopkAcc : [2023/04/24 09:09:39] ppcls INFO: topk : [1, 5] [2023/04/24 09:09:39] ppcls INFO: Train : None [2023/04/24 09:09:39] ppcls INFO: Optimizer : [2023/04/24 09:09:39] ppcls INFO: lr : [2023/04/24 09:09:39] ppcls INFO: learning_rate : 0.1 [2023/04/24 09:09:39] ppcls INFO: name : Cosine [2023/04/24 09:09:39] ppcls INFO: momentum : 0.9 [2023/04/24 09:09:39] ppcls INFO: name : Momentum [2023/04/24 09:09:39] ppcls INFO: regularizer : [2023/04/24 09:09:39] ppcls INFO: coeff : 0.0001 [2023/04/24 09:09:39] ppcls INFO: name : L2 [2023/04/24 09:09:39] ppcls INFO: profiler_options : None [2023/04/24 09:09:39] ppcls INFO: train with paddle 2.4.2 and device Place(gpu:0) W0424 09:09:43.058949 7970 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 12.0, Runtime API Version: 11.7 W0424 09:09:43.059475 7970 gpu_resources.cc:91] device: 0, cuDNN Version: 8.9. [2023/04/24 09:09:43] ppcls WARNING: The training strategy provided by PaddleClas is based on 4 gpus. But the number of gpu is 2 in current training. Please modify the stategy (learning rate, batch size and so on) if use this config to train. I0424 09:09:43.886898 7970 tcp_utils.cc:181] The server starts to listen on IP_ANY:48471 I0424 09:09:43.887037 7970 tcp_utils.cc:130] Successfully connected to 127.0.1.1:48471 [2023/04/24 09:09:57] ppcls INFO: [Train][Epoch 1/200][Iter: 0/5493]lr(CosineAnnealingDecay): 0.10000000, CELoss: 0.70461, loss: 0.70461, batch_cost: 12.86058s, reader_cost: 0.89499, ips: 0.31103 samples/s, eta: 163 days, 12:37:14


C++ Traceback (most recent call last):

0 paddle::pybind::ThrowExceptionToPython(std::__exception_ptr::exception_ptr)


Error Message Summary:

FatalError: Process abort signal is detected by the operating system. [TimeInfo: Aborted at 1682298608 (unix time) try "date -d @1682298608" if you are using GNU date ] [SignalInfo: SIGABRT (@0x3e800001f22) received by PID 7970 (TID 0x7f4b4ac39740) from PID 7970 ]

LAUNCH INFO 2023-04-24 09:10:08,912 Pod failed LAUNCH ERROR 2023-04-24 09:10:08,912 Container failed !!! Container rank 0 status failed cmd ['/home/uvtec/miniconda3/envs/paddle_gpu/bin/python', '-u', 'tools/train.py', '-c', './ppcls/configs/ImageNet/ResNet/ResNet101_vd.yaml'] code -6 log log/workerlog.0 env {'SHELL': '/bin/bash', 'SESSION_MANAGER': 'local/uvtec-MS-7B98:@/tmp/.ICE-unix/2423,unix/uvtec-MS-7B98:/tmp/.ICE-unix/2423', 'QT_ACCESSIBILITY': '1', 'SNAP_REVISION': '327', 'XDG_CONFIG_DIRS': '/etc/xdg/xdg-ubuntu:/etc/xdg', 'SSH_AGENT_LAUNCHER': 'gnome-keyring', 'XDG_MENU_PREFIX': 'gnome-', 'GNOME_DESKTOP_SESSION_ID': 'this-is-deprecated', 'CONDA_EXE': '/home/uvtec/miniconda3/bin/conda', '_CE_M': '', 'SNAP_REAL_HOME': '/home/uvtec', 'TERMINAL_EMULATOR': 'JetBrains-JediTerm', 'SNAP_USER_COMMON': '/home/uvtec/snap/pycharm-community/common', 'LANGUAGE': 'zh_CN:en', 'LC_ADDRESS': 'zh_CN.UTF-8', 'GNOME_SHELL_SESSION_MODE': 'ubuntu', 'LC_NAME': 'zh_CN.UTF-8', 'SSH_AUTH_SOCK': '/run/user/1000/keyring/ssh', 'TERM_SESSION_ID': 'a956c70c-5d9f-4680-8ac6-2ad7d58ec4f1', 'SNAP_INSTANCE_KEY': '', 'XMODIFIERS': '@im=ibus', 'DESKTOP_SESSION': 'ubuntu', 'LC_MONETARY': 'zh_CN.UTF-8', 'BAMF_DESKTOP_FILE_HINT': '/var/lib/snapd/desktop/applications/pycharm-community_pycharm-community.desktop', 'GTK_MODULES': 'gail:atk-bridge', 'PWD': '/media/uvtec/9AF4F5A0F4F57EB7/PaddleClas', 'XDG_SESSION_DESKTOP': 'ubuntu', 'LOGNAME': 'uvtec', 'XDG_SESSION_TYPE': 'x11', 'CONDA_PREFIX': '/home/uvtec/miniconda3/envs/paddle_gpu', 'GPG_AGENT_INFO': '/run/user/1000/gnupg/S.gpg-agent:0:1', 'SYSTEMD_EXEC_PID': '2442', 'XAUTHORITY': '/run/user/1000/gdm/Xauthority', 'DESKTOP_STARTUP_ID': 'gnome-shell/PyCharm Community Edition/2442-0-uvtec-MS-7B98_TIME1963332', 'SNAP_CONTEXT': 'ItXkcY4SkrtWx8kN-yr9nOSIKj2CLtxC0KLdiOUjek9vLzFvhRJT', 'GJS_DEBUG_TOPICS': 'JS ERROR;JS LOG', 'WINDOWPATH': '2', 'HOME': '/home/uvtec', 'USERNAME': 'uvtec', 'LANG': 'zh_CN.UTF-8', 'LC_PAPER': 'zh_CN.UTF-8', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.webp=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:.xspf=00;36:', 'XDG_CURRENT_DESKTOP': 'ubuntu:GNOME', 'SNAP_ARCH': 'amd64', 'SNAP_INSTANCE_NAME': 'pycharm-community', 'SNAP_USER_DATA': '/home/uvtec/snap/pycharm-community/327', 'CONDA_PROMPT_MODIFIER': '(paddle_gpu) ', 'INVOCATION_ID': 'f614011ab40f4ab98614d015201130dd', 'MANAGERPID': '2142', 'SNAP_REEXEC': '', 'GJS_DEBUG_OUTPUT': 'stderr', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'XDG_SESSION_CLASS': 'user', 'TERM': 'xterm-256color', 'LC_IDENTIFICATION': 'zh_CN.UTF-8', '_CE_CONDA': '', 'LESSOPEN': '| /usr/bin/lesspipe %s', 'USER': 'uvtec', 'SNAP': '/snap/pycharm-community/327', 'CONDA_SHLVL': '2', 'SNAP_COMMON': '/var/snap/pycharm-community/common', 'SNAP_VERSION': '2023.1', 'DISPLAY': ':1', 'SHLVL': '1', 'SNAP_LIBRARY_PATH': '/var/lib/snapd/lib/gl:/var/lib/snapd/lib/gl32:/var/lib/snapd/void', 'SNAP_COOKIE': 'ItXkcY4SkrtWx8kN-yr9nOSIKj2CLtxC0KLdiOUjek9vLzFvhRJT', 'LC_TELEPHONE': 'zh_CN.UTF-8', 'QT_IM_MODULE': 'ibus', 'LC_MEASUREMENT': 'zh_CN.UTF-8', 'PAPERSIZE': 'a4', 'SNAP_DATA': '/var/snap/pycharm-community/327', 'CONDA_PYTHON_EXE': '/home/uvtec/miniconda3/bin/python', 'LD_LIBRARY_PATH': '/home/uvtec/miniconda3/envs/paddle_gpu/lib/python3.8/site-packages/cv2/../../lib64:/usr/local/cuda-11.7/lib64:', 'XDG_RUNTIME_DIR': '/run/user/1000', 'CONDA_DEFAULT_ENV': 'paddle_gpu', 'LC_TIME': 'zh_CN.UTF-8', 'SNAP_NAME': 'pycharm-community', 'JOURNAL_STREAM': '8:31656', 'XDG_DATA_DIRS': '/usr/share/ubuntu:/usr/share/gnome:/usr/local/share/:/usr/share/:/var/lib/snapd/desktop', 'PATH': '/home/uvtec/miniconda3/envs/paddle_gpu/bin:/home/uvtec/miniconda3/condabin:/usr/local/cuda-11.7/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin', 'GDMSESSION': 'ubuntu', 'DBUS_SESSION_BUS_ADDRESS': 'unix:path=/run/user/1000/bus', 'CONDA_PREFIX_1': '/home/uvtec/miniconda3', 'GIO_LAUNCHED_DESKTOP_FILE_PID': '3863', 'GIO_LAUNCHED_DESKTOP_FILE': '/var/lib/snapd/desktop/applications/pycharm-community_pycharm-community.desktop', 'LC_NUMERIC': 'zhCN.UTF-8', '': '/home/uvtec/miniconda3/envs/paddle_gpu/bin/python', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'QT_QPA_PLATFORM_PLUGIN_PATH': '/home/uvtec/miniconda3/envs/paddle_gpu/lib/python3.8/site-packages/cv2/qt/plugins', 'QT_QPA_FONTDIR': '/home/uvtec/miniconda3/envs/paddle_gpu/lib/python3.8/site-packages/cv2/qt/fonts', 'POD_NAME': 'gnhppr', 'PADDLE_MASTER': '127.0.1.1:48471', 'PADDLE_GLOBAL_SIZE': '2', 'PADDLE_LOCAL_SIZE': '2', 'PADDLE_GLOBAL_RANK': '0', 'PADDLE_LOCAL_RANK': '0', 'PADDLE_NNODES': '1', 'PADDLE_TRAINER_ENDPOINTS': '127.0.1.1:48472,127.0.1.1:48473', 'PADDLE_CURRENT_ENDPOINT': '127.0.1.1:48472', 'PADDLE_TRAINER_ID': '0', 'PADDLE_TRAINERS_NUM': '2', 'PADDLE_RANK_IN_NODE': '0', 'FLAGS_selected_gpus': '0'} LAUNCH INFO 2023-04-24 09:10:08,912 ------------------------- ERROR LOG DETAIL ------------------------- 5, 0.456, 0.406] [2023/04/24 09:09:39] ppcls INFO: order : [2023/04/24 09:09:39] ppcls INFO: scale : 1.0/255.0 [2023/04/24 09:09:39] ppcls INFO: std : [0.229, 0.224, 0.225] [2023/04/24 09:09:39] ppcls INFO: ToCHWImage : None [2023/04/24 09:09:39] ppcls INFO: Loss : [2023/04/24 09:09:39] ppcls INFO: Eval : [2023/04/24 09:09:39] ppcls INFO: CELoss : [2023/04/24 09:09:39] ppcls INFO: weight : 1.0 [2023/04/24 09:09:39] ppcls INFO: Train : [2023/04/24 09:09:39] ppcls INFO: CELoss : [2023/04/24 09:09:39] ppcls INFO: epsilon : 0.1 [2023/04/24 09:09:39] ppcls INFO: weight : 1.0 [2023/04/24 09:09:39] ppcls INFO: Metric : [2023/04/24 09:09:39] ppcls INFO: Eval : [2023/04/24 09:09:39] ppcls INFO: TopkAcc : [2023/04/24 09:09:39] ppcls INFO: topk : [1, 5] [2023/04/24 09:09:39] ppcls INFO: Train : None [2023/04/24 09:09:39] ppcls INFO: Optimizer : [2023/04/24 09:09:39] ppcls INFO: lr : [2023/04/24 09:09:39] ppcls INFO: learning_rate : 0.1 [2023/04/24 09:09:39] ppcls INFO: name : Cosine [2023/04/24 09:09:39] ppcls INFO: momentum : 0.9 [2023/04/24 09:09:39] ppcls INFO: name : Momentum [2023/04/24 09:09:39] ppcls INFO: regularizer : [2023/04/24 09:09:39] ppcls INFO: coeff : 0.0001 [2023/04/24 09:09:39] ppcls INFO: name : L2 [2023/04/24 09:09:39] ppcls INFO: profiler_options : None [2023/04/24 09:09:39] ppcls INFO: train with paddle 2.4.2 and device Place(gpu:0) W0424 09:09:43.058949 7970 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 12.0, Runtime API Version: 11.7 W0424 09:09:43.059475 7970 gpu_resources.cc:91] device: 0, cuDNN Version: 8.9. [2023/04/24 09:09:43] ppcls WARNING: The training strategy provided by PaddleClas is based on 4 gpus. But the number of gpu is 2 in current training. Please modify the stategy (learning rate, batch size and so on) if use this config to train. I0424 09:09:43.886898 7970 tcp_utils.cc:181] The server starts to listen on IP_ANY:48471 I0424 09:09:43.887037 7970 tcp_utils.cc:130] Successfully connected to 127.0.1.1:48471 [2023/04/24 09:09:57] ppcls INFO: [Train][Epoch 1/200][Iter: 0/5493]lr(CosineAnnealingDecay): 0.10000000, CELoss: 0.70461, loss: 0.70461, batch_cost: 12.86058s, reader_cost: 0.89499, ips: 0.31103 samples/s, eta: 163 days, 12:37:14


C++ Traceback (most recent call last):

0 paddle::pybind::ThrowExceptionToPython(std::__exception_ptr::exception_ptr)


Error Message Summary:

FatalError: Process abort signal is detected by the operating system. [TimeInfo: Aborted at 1682298608 (unix time) try "date -d @1682298608" if you are using GNU date ] [SignalInfo: SIGABRT (@0x3e800001f22) received by PID 7970 (TID 0x7f4b4ac39740) from PID 7970 ]

Hongyuan-Liu commented 1 year ago

我也遇到这个问题,请问怎么解决呢

happybear1015 commented 1 year ago

我也遇到这个问题,请问怎么解决呢

重启解决了!你也试试吧!(注意batch_size设置小一些)

happybear1015 commented 1 year ago

ok!