123456hxh commented 1 year ago

请问一下，博主在运行train.py文件时，会出现下面的问题吗？是属于版本问题吗？ OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/. OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/. OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/. OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/. Traceback (most recent call last): File "C:\Users\hxh\anaconda3\envs\pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 1134, in _try_get_data data = self._data_queue.get(timeout=timeout) File "C:\Users\hxh\anaconda3\envs\pytorch\lib\multiprocessing\queues.py", line 105, in get raise Empty _queue.Empty

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "D:/PycharmProjects/PycharmProjects/霹雳大佬/pytorch_object_detection/yolov3_spp/train.py", line 302, in train(hyp) File "D:/PycharmProjects/PycharmProjects/霹雳大佬/pytorch_object_detection/yolov3_spp/train.py", line 208, in train scaler=scaler) File "D:\PycharmProjects\PycharmProjects\霹雳大佬\pytorch_object_detection\yolov3_spp\train_utils\train_eval_utils.py", line 35, in train_oneepoch for i, (imgs, targets, paths, , _) in enumerate(metric_logger.log_every(data_loader, print_freq, header)): File "D:\PycharmProjects\PycharmProjects\霹雳大佬\pytorch_object_detection\yolov3_spp\train_utils\distributed_utils.py", line 205, in log_every for obj in iterable: File "C:\Users\hxh\anaconda3\envs\pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 652, in next data = self._next_data() File "C:\Users\hxh\anaconda3\envs\pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 1330, in _next_data idx, data = self._get_data() File "C:\Users\hxh\anaconda3\envs\pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 1296, in _get_data success, data = self._try_get_data() File "C:\Users\hxh\anaconda3\envs\pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 1147, in _try_get_data raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e RuntimeError: DataLoader worker (pid(s) 9508, 11376, 15364, 15632) exited unexpectedly

Allenem commented 1 year ago

RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e RuntimeError: DataLoader worker (pid(s) 9508, 11376, 15364, 15632) exited unexpectedly

会不会是dataloader的num_workers设置的原因，把它设置为num_workers=0试一试？下面训练数据和验证数据中的对应地方改一下试试？ https://github.com/Allenem/YOLOv3SPP/blob/2d10f858a20b6384b753a802796bd0aab4d5ccd7/train.py#L170 https://github.com/Allenem/YOLOv3SPP/blob/2d10f858a20b6384b753a802796bd0aab4d5ccd7/train.py#L172 https://github.com/Allenem/YOLOv3SPP/blob/2d10f858a20b6384b753a802796bd0aab4d5ccd7/train.py#L178 https://github.com/Allenem/YOLOv3SPP/blob/2d10f858a20b6384b753a802796bd0aab4d5ccd7/train.py#L180

还有注意用GPU跑代码，train.py是单个GPU，train_multi_GPU.py是多个GPU。

123456hxh commented 1 year ago

谢谢博主，能够跑通了，但是我在租用的远程GPU上面运行train.py文件时会出现： Traceback (most recent call last): File "train.py", line 302, in train(hyp) File "train.py", line 198, in train mloss, lr = train_util.train_one_epoch(model, optimizer, train_dataloader, File "/root/yolov3_spp/train_utils/train_eval_utils.py", line 38, in train_one_epoch imgs = imgs.to(device).float() / 255.0 # uint8 to float32, 0 - 255 to 0.0 - 1.0 RuntimeError: CUDA error: no kernel image is available for execution on the device

请问这个是版本问题吗？

Allenem commented 1 year ago

应该是安装的cuda, pytorch和当前的GPU型号并不匹配。可以看一下知乎这里有解答：https://zhuanlan.zhihu.com/p/466793485

123456hxh commented 1 year ago

谢谢

Allenem / YOLOv3SPP

是pytorch版本问题吗？ #1

请问这个是版本问题吗？