我运行一个epoch十个小时都运行不完

JCruan519 / VM-UNet

(ARXIV24) This is the official code repository for "VM-UNet: Vision Mamba UNet for Medical Image Segmentation".

Apache License 2.0

471 stars 22 forks source link

我运行一个epoch十个小时都运行不完 #25

Open lx1596 opened 6 months ago

lx1596 commented 6 months ago

我使用3090十个小时都运行不完一个epoch，这正常吗

JCruan519 commented 6 months ago

@lx1596 Hello, this is obviously abnormal. We typically run an epoch on an A6000 GPU in about 2 minutes, and the duration is similar on a 3090 GPU.

Here are some suggestions for troubleshooting:

Check if the GPU is being properly utilized during the training process.
Ensure that the installed version of PyTorch is the GPU version. We recommend downloading the corresponding .whl installation package directly from the following link (https://download.pytorch.org/whl/torch_stable.html).
Investigate whether slow data loading and invocation speeds are contributing to the issue.

fujinghao commented 6 months ago

我使用3090十个小时都运行不完一个epoch，这正常吗

你是在windows上训练的吗，换Linux就行了

lx1596 commented 6 months ago

ok, thanks

394481125 commented 5 months ago

Same as before, unresolved. Linux system, 3090ti single card, runs very slowly, batch size can only be set to 2 to run

fengchuanpeng commented 5 months ago

@lx1596 你好，这显然是不正常的。我们通常在A6000 GPU上运行一个epoch大约需要2分钟，而在3090 GPU上持续时间相似。

以下是一些故障排除建议：

检查训练过程中GPU是否得到正确利用。

确保安装的PyTorch版本是GPU版本。我们建议直接从以下链接（https://download.pytorch.org/whl/torch_stable.html）下载相应的.whl安装包。

调查数据加载和调用速度慢是否导致了该问题。

您好，为什么我在调试时，它在反向传播这一步耗时很久

tmax-cn commented 5 months ago

@lx1596 你好，这显然是不正常的。我们通常在A6000 GPU上运行一个epoch大约需要2分钟，而在3090 GPU上持续时间相似。以下是一些故障排除建议：

检查训练过程中GPU是否得到正确利用。

确保安装的PyTorch版本是GPU版本。我们建议直接从以下链接（https://download.pytorch.org/whl/torch_stable.html）下载相应的.whl安装包。

调查数据加载和调用速度慢是否导致了该问题。

您好，为什么我在调试时，它在反向传播这一步耗时很久

我使用的是也是Windows环境，也是在loss.backwards这一步会卡很久，请问您解决了吗？还是使用linux就没有问题？

plo97 commented 4 months ago

@lx1596 你好，这显然是不正常的。我们通常在A6000 GPU上运行一个epoch大约需要2分钟，而在3090 GPU上持续时间相似。以下是一些故障排除建议：

检查训练过程中GPU是否得到正确利用。

确保安装的PyTorch版本是GPU版本。我们建议直接从以下链接（https://download.pytorch.org/whl/torch_stable.html）下载相应的.whl安装包。

调查数据加载和调用速度慢是否导致了该问题。

您好，为什么我在调试时，它在反向传播这一步耗时很久

我使用的是也是Windows环境，也是在loss.backwards这一步会卡很久，请问您解决了吗？还是使用linux就没有问题？

您是如何做到在Windows下运行的？不过看样子运行慢是windows的问题需要在Linux上才可以

lx1596 commented 3 months ago

我最后没有在win上运行，在linux上正常运行，我猜测在win上配置mamba环境时修改了scan_fn_cuda，导致反向推理时无法在显卡上运行。

xwgit2023 commented 2 months ago

我在linux上也是在loss.backwards这一步会卡很久，请问您解决了吗？怎么弄呀

Wang-1812 commented 1 month ago

NameError: name 'selective_scan_fn' is not defined

Can i ask how you solve the problem? I have installed mamba_ssm and causal_conv1d