Closed rzwang111 closed 4 months ago
this problem comes from the dismatch of pytorch, xformers,and cuda,
Hi. Could you please share with me, how do you solved it?
I run the code on RTX3090, torch 2.2.0 and cuda 11.7. And I got the same error.
Thanks!
Hi. Could you please share with me, how do you solved it?
I run the code on RTX3090, torch 2.2.0 and cuda 11.7. And I got the same error.
Thanks!
Sorry, I mean I run it with torch 2.0.0 and cuda 11.7
@JiuTongBro Hi, have you solved this issue?
Unfortunately, no... It seems the current version of xFormer has a bug. I remember someone mentioned, currently xFormer can not run on 30xx or 40xx GPU. They tested that, it can successfully run on V100 or A100. I also tried to run this on A6000, it fails too.
However, I can not figure out which dependency of this code uses xFormer. I didn't see xFormer installed in the environment at all.
It shouldn't be, I ran other codebase along with xformers on A6000 and it has no issues. How do you know this issue was caused by xformers?
@JunzheJosephZhu May I ask what type of gpu you used?
From my experience, first installing pytorch then installing xformer will cause pip to automatically upgrade pytorch, which causes issues. I usually do pip install torch=xxx+cuxxx xformers --extra-index-url=....
I used A100 and 4090
Yes, should be careful when installing xformers, I remember xformers==0.0.20 is for torch==2.0.1+cu118 and newer version for torch==2.1*
It shouldn't be, I ran other codebase along with xformers on A6000 and it has no issues. How do you know this issue was caused by xformers?
I remember there are related issue in the project of xformers. I searched this bug on Google and found them.
Yes, should be careful when installing xformers, I remember xformers==0.0.20 is for torch==2.0.1+cu118 and newer version for torch==2.1*
Thanks for your sharing. I plan to check out my environment, and try to run this codebase on A6000 again later.
I don't think this CUDA error was caused by xformers since it happened in backward process. And I delete this line just now and test, this issue still exist: https://github.com/JunzheJosephZhu/HiFA/blob/1bbe86135f960f4f99ab1f4c294bb5f4151da273/nerf/sd.py#L150-L151
I was able to solve it by upgrading to Python 3.10, installing pytorch 2.2.2 manually and removing all version numbers (e.g. using diffusers
instead of diffusers==0.20
in the requirements.txt
not sure which step is the one that solved the issue, but doing all of them together worked for me.
/HiFA/main.py:302 in │
│ │
│ 299 │ │ else: │
│ 300 │ │ │ valid_loader = NeRFDataset(opt, device=device, type='val', H=opt.H, W=opt.W, │
│ 301 │ │ │ max_epoch = np.ceil(opt.iters / len(train_loader)).astype(np.int32) │
│ ❱ 302 │ │ │ ran_anything = trainer.train(train_loader, valid_loader, max_epoch) │
│ 303 │ │ │ if ran_anything: │
│ 304 │ │ │ │ test_loader = NeRFDataset(opt, device=device, type='test', H=opt.H, W=op │
│ 305 │ │ │ │ try: │
│ │
/HiFA/nerf/utils.py:918 in train │
│ │
│ 915 │ │ for epoch in range(self.epoch + 1, max_epochs + 1): │
│ 916 │ │ │ self.epoch = epoch │
│ 917 │ │ │ │
│ ❱ 918 │ │ │ self.train_one_epoch(train_loader) │
│ 919 │ │ │ │
│ 920 │ │ │ if self.workspace is not None and self.local_rank == 0: │
│ 921 │ │ │ │ self.save_checkpoint(full=True, best=False) │
│ │
/HiFA/nerf/utils.py:1192 in train_one_epoch │
│ │
│ 1189 │ │ │ │
│ 1190 │ │ │ # loss.backward() │
│ 1191 │ │ │ start = time.time() │
│ ❱ 1192 │ │ │ self.scaler.scale(loss).backward() │
│ 1193 │ │ │ │
│ 1194 │ │ │ self.post_train_step() │
│ 1195 │ │ │ # self.optimizer.step() │
│ │
/.conda/envs/hifa/lib/python3.9/site-packages/torch/_tensor.py:487 in backward │
│ │
│ 484 │ │ │ │ create_graph=create_graph, │
│ 485 │ │ │ │ inputs=inputs, │
│ 486 │ │ │ ) │
│ ❱ 487 │ │ torch.autograd.backward( │
│ 488 │ │ │ self, gradient, retain_graph, create_graph, inputs=inputs │
│ 489 │ │ ) │
│ 490 │
│ │
/.conda/envs/hifa/lib/python3.9/site-packages/torch/autograd/init.py:200 in │
│ backward │
│ │
│ 197 │ # The reason we repeat same the comment below is that │
│ 198 │ # some Python versions print out the first line of a multi-line function │
│ 199 │ # calls in the traceback and some print out the last line │
│ ❱ 200 │ Variable._execution_engine.run_backward( # Calls into the C++ engine to run the bac │
│ 201 │ │ tensors, gradtensors, retain_graph, create_graph, inputs, │
│ 202 │ │ allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to ru │
│ 203 │