JunzheJosephZhu / HiFA

Apache License 2.0
185 stars 5 forks source link

RuntimeError: CUDA error: invalid argument #9

Closed rzwang111 closed 4 months ago

rzwang111 commented 4 months ago

/HiFA/main.py:302 in │ │ │ │ 299 │ │ else: │ │ 300 │ │ │ valid_loader = NeRFDataset(opt, device=device, type='val', H=opt.H, W=opt.W, │ │ 301 │ │ │ max_epoch = np.ceil(opt.iters / len(train_loader)).astype(np.int32) │ │ ❱ 302 │ │ │ ran_anything = trainer.train(train_loader, valid_loader, max_epoch) │ │ 303 │ │ │ if ran_anything: │ │ 304 │ │ │ │ test_loader = NeRFDataset(opt, device=device, type='test', H=opt.H, W=op │ │ 305 │ │ │ │ try: │ │ │ /HiFA/nerf/utils.py:918 in train │ │ │ │ 915 │ │ for epoch in range(self.epoch + 1, max_epochs + 1): │ │ 916 │ │ │ self.epoch = epoch │ │ 917 │ │ │ │ │ ❱ 918 │ │ │ self.train_one_epoch(train_loader) │ │ 919 │ │ │ │ │ 920 │ │ │ if self.workspace is not None and self.local_rank == 0: │ │ 921 │ │ │ │ self.save_checkpoint(full=True, best=False) │ │ │ /HiFA/nerf/utils.py:1192 in train_one_epoch │ │ │ │ 1189 │ │ │ │ │ 1190 │ │ │ # loss.backward() │ │ 1191 │ │ │ start = time.time() │ │ ❱ 1192 │ │ │ self.scaler.scale(loss).backward() │ │ 1193 │ │ │ │ │ 1194 │ │ │ self.post_train_step() │ │ 1195 │ │ │ # self.optimizer.step() │ │ │ /.conda/envs/hifa/lib/python3.9/site-packages/torch/_tensor.py:487 in backward │ │ │ │ 484 │ │ │ │ create_graph=create_graph, │ │ 485 │ │ │ │ inputs=inputs, │ │ 486 │ │ │ ) │ │ ❱ 487 │ │ torch.autograd.backward( │ │ 488 │ │ │ self, gradient, retain_graph, create_graph, inputs=inputs │ │ 489 │ │ ) │ │ 490 │ │ │ /.conda/envs/hifa/lib/python3.9/site-packages/torch/autograd/init.py:200 in │ │ backward │ │ │ │ 197 │ # The reason we repeat same the comment below is that │ │ 198 │ # some Python versions print out the first line of a multi-line function │ │ 199 │ # calls in the traceback and some print out the last line │ │ ❱ 200 │ Variable._execution_engine.run_backward( # Calls into the C++ engine to run the bac │ │ 201 │ │ tensors, gradtensors, retain_graph, create_graph, inputs, │ │ 202 │ │ allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to ru │ │ 203 │

rzwang111 commented 4 months ago

this problem comes from the dismatch of pytorch, xformers,and cuda,

JiuTongBro commented 3 months ago

Hi. Could you please share with me, how do you solved it?

I run the code on RTX3090, torch 2.2.0 and cuda 11.7. And I got the same error.

Thanks!

JiuTongBro commented 3 months ago

Hi. Could you please share with me, how do you solved it?

I run the code on RTX3090, torch 2.2.0 and cuda 11.7. And I got the same error.

Thanks!

Sorry, I mean I run it with torch 2.0.0 and cuda 11.7

OrangeSodahub commented 3 months ago

@JiuTongBro Hi, have you solved this issue?

JiuTongBro commented 3 months ago

Unfortunately, no... It seems the current version of xFormer has a bug. I remember someone mentioned, currently xFormer can not run on 30xx or 40xx GPU. They tested that, it can successfully run on V100 or A100. I also tried to run this on A6000, it fails too.

JiuTongBro commented 3 months ago

However, I can not figure out which dependency of this code uses xFormer. I didn't see xFormer installed in the environment at all.

OrangeSodahub commented 3 months ago

It shouldn't be, I ran other codebase along with xformers on A6000 and it has no issues. How do you know this issue was caused by xformers?

OrangeSodahub commented 3 months ago

@JunzheJosephZhu May I ask what type of gpu you used?

JunzheJosephZhu commented 3 months ago

From my experience, first installing pytorch then installing xformer will cause pip to automatically upgrade pytorch, which causes issues. I usually do pip install torch=xxx+cuxxx xformers --extra-index-url=....

JunzheJosephZhu commented 3 months ago

I used A100 and 4090

OrangeSodahub commented 3 months ago

Yes, should be careful when installing xformers, I remember xformers==0.0.20 is for torch==2.0.1+cu118 and newer version for torch==2.1*

JiuTongBro commented 3 months ago

It shouldn't be, I ran other codebase along with xformers on A6000 and it has no issues. How do you know this issue was caused by xformers?

I remember there are related issue in the project of xformers. I searched this bug on Google and found them.

JiuTongBro commented 3 months ago

Yes, should be careful when installing xformers, I remember xformers==0.0.20 is for torch==2.0.1+cu118 and newer version for torch==2.1*

Thanks for your sharing. I plan to check out my environment, and try to run this codebase on A6000 again later.

OrangeSodahub commented 3 months ago

I don't think this CUDA error was caused by xformers since it happened in backward process. And I delete this line just now and test, this issue still exist: https://github.com/JunzheJosephZhu/HiFA/blob/1bbe86135f960f4f99ab1f4c294bb5f4151da273/nerf/sd.py#L150-L151

dabeschte commented 3 months ago

I was able to solve it by upgrading to Python 3.10, installing pytorch 2.2.2 manually and removing all version numbers (e.g. using diffusers instead of diffusers==0.20 in the requirements.txt

not sure which step is the one that solved the issue, but doing all of them together worked for me.