Open zhaoawen opened 6 months ago
Tried a lot of methods, did not get a good solution, hard to help me take a look. Thank you @mrokuss
If I use nnunetv2-2.3.1 version of the code is able to train the data normally, for now I use the old version first to solve my problem, if you have time to look hard why the new version is not supported.
i have same question,how to solve?
Hey @zhaoawen
As a first guess, this looks rather like a torch import problem to me than an issue with nnUNet. Could you try updating torch to the newest version or start with a completely new and up to date environment and then run again with the newest nnunet version?
Best,
Max
I'm getting the same error using the most up to date nnUNet. I did a clean env install of the most up to date pytorch with cuda 12.1 and also followed suggestions on (https://github.com/MIC-DKFZ/batchgenerators/issues/23). I get the exact same error reported above.
I could solve this issue by installing triton separately (i.e. by doing pip install triton
)
Hey @zhaoawen
As a first guess, this looks rather like a torch import problem to me than an issue with nnUNet. Could you try updating torch to the newest version or start with a completely new and up to date environment and then run again with the newest nnunet version?
Best,
Max I ran into the same problem, and it worked fine on older versions, but the latest version showed triton was not installed, but I couldn't install triton on windows
Hey! I encountered a very similar issue-- my Triton error message said : _"torch.dynamo.exc.BackendCompilerFailed: backend='inductor' raised: RuntimeError: Triton Error [CUDA]: device kernel image is invalid" I verified that I could connect and use CUDA but was still receiving this message. I followed zhaoawen's advice and uninstalled the latest version nnUnet and installed nnUNetv2-2.3.1 and it worked! So far my model is training (fingers crossed it stays that way).
git clone https://github.com/MIC-DKFZ/nnUNet.git cd nnUNet git checkout tags/v2.3.1 -b version-2.3.1-branch
installed nnUNetv2-2.3.1,can solve
As with above issues, the problem occurs with my nnUNet version 2.5. Previous versions of my nnUNet project that are running separately from the new nnUNet installation are running on version 2.3.1. These are running on identical servers and are still running fine.
hey @naga-karthik
I could solve this issue by installing triton separately (i.e. by doing pip install triton)
Sadly this does not work for windows, as triton is only supported on linux.
Please install pytorch as specified in the installation instructions. Please use the most recent version with the highest available version of CUDA. I recommend using a conda environment. Triton will be automatically installed if you do it this way.
The reason you are encountering issues with triton is that with v2.4 we enable torch.compile by default. Depending on the GPU this will result in large speed-ups during training. 10-30%. So it's definitely worth it.
If you want to disable torch.compile in nnU-Net, just export nnUNet_compile=f
or do nnUNet_compile=f nnUNetv2_train [...]
Best,
Fabian
export nnUNet_compile=f
it solved it. Thanks
This does not work for Windows 10 with Anaconda power shell:
export nnUNet_compile=f
export : Die Benennung "export" wurde nicht als Name eines Cmdlet, einer Funktion, einer Skriptdatei oder eines ausführbaren Programms erkannt. Überprüfen Sie die Schreibweise des Namens, oder ob der Pfad
korrekt ist (sofern enthalten), und wiederholen Sie den Vorgang.
In Zeile:1 Zeichen:1
+ export nnUNet_compile=f
+ ~~~~~~
+ CategoryInfo : ObjectNotFound: (export:String) [], CommandNotFoundException
+ FullyQualifiedErrorId : CommandNotFoundException
Update:
I updated CUDA to 12.1, recreated the conda environment with python 3.11.9, installed nnunetv2 with pip install nnunetv2
, but when I run the training it says RuntimeError: Cannot find a working triton installation.
and when I try to run export nnUNet_compile=f
I get the same error message as already mentioned.
Update 2: Okay now I understood. It is an environment variable that we have to set. On Anaconda Powershall for Windows 10 this would be
conda env config vars set nnUNet_compile=f
Then it works
I don't know what's going on, reporting this kind of error. Everything is normal before the training, this problem suddenly occurred, can you help me look at it? 2024-04-20 08:27:16.276530: Epoch 600 2024-04-20 08:27:16.276754: Current learning rate: 0.00438 Traceback (most recent call last): File "/opt/conda/bin/nnUNetv2_train", line 8, in
sys.exit(run_training_entry())
File "/opt/conda/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 274, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/opt/conda/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 210, in run_training
nnunet_trainer.run_training()
File "/opt/conda/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1295, in run_training
train_outputs.append(self.train_step(next(self.dataloader_train)))
File "/opt/conda/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 922, in train_step
output = self.network(data)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, *kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
return fn(args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, *kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors
return callback(frame, cache_entry, hooks, frame_state)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 641, in _convert_frame
result = inner_convert(frame, cache_size, hooks, frame_state)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 133, in _fn
return fn(args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 389, in _convert_frame_assert
return _compile(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 569, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(*args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 491, in compile_inner
out_code = transform_code_object(code, transform)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
transformations(instructions, code_options)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 458, in transform
tracer.run()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2069, in run
super().run()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 719, in run
and self.step()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 683, in step
getattr(self, inst.opname)(inst)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2157, in RETURN_VALUE
self.output.compile_subgraph(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 857, in compile_subgraph
self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, *kwds)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 957, in compile_and_call_fx_graph
compiled_fn = self.call_user_compiler(gm)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1024, in call_user_compiler
raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1009, in call_user_compiler
compiled_fn = compiler_fn(gm, self.example_inputs())
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
compiled_gm = compiler_fn(gm, example_inputs)
File "/opt/conda/lib/python3.10/site-packages/torch/init.py", line 1568, in call
return compilefx(model, inputs_, config_patches=self.config)
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1150, in compile_fx
return aot_autograd(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/backends/common.py", line 55, in compiler_fn
cg = aot_module_simplified(gm, example_inputs, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 3891, in aot_module_simplified
compiled_fn = create_aot_dispatcher_function(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(*args, *kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 3429, in create_aot_dispatcher_function
compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata)
File "/opt/conda/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 2212, in aot_wrapper_dedupe
return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
File "/opt/conda/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 2392, in aot_wrapper_synthetic_base
return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
File "/opt/conda/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 2917, in aot_dispatch_autograd
compiled_fw_func = aot_config.fw_compiler(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1092, in fw_compiler_base
return inner_compile(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/repro/after_aot.py", line 80, in debug_wrapper
inner_compiled_fn = compiler_fn(gm, example_inputs)
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/debug.py", line 228, in inner
return fn(*args, kwargs)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, *kwds)
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 54, in newFunction
return old_func(args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 341, in compile_fx_inner
compiled_graph: CompiledFxGraph = fx_codegen_and_compile(
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 565, in fx_codegen_and_compile
compiled_fn = graph.compile_to_fn()
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/graph.py", line 970, in compile_to_fn
return self.compile_to_module().call
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(*args, *kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/graph.py", line 938, in compile_to_module
code, linemap = self.codegen()
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/graph.py", line 913, in codegen
self.scheduler = Scheduler(self.buffers)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/scheduler.py", line 971, in init
self.nodes = [self.create_scheduler_node(n) for n in nodes]
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/scheduler.py", line 971, in
self.nodes = [self.create_scheduler_node(n) for n in nodes]
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/scheduler.py", line 1037, in create_scheduler_node
group_fn = self.get_backend(node.get_device()).group_fn
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/scheduler.py", line 1642, in get_backend
self.backends[device] = self.create_backend(device)
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/scheduler.py", line 1634, in create_backend
raise RuntimeError(
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Cannot find a working triton installation. More information on installing Triton can be found at https://github.com/openai/triton
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True
Exception in thread Thread-3 (results_loop): Traceback (most recent call last): File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/opt/conda/lib/python3.10/threading.py", line 953, in run self._target(*self._args, *self._kwargs) File "/opt/conda/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/opt/conda/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Exception in thread Thread-2 (results_loop): Traceback (most recent call last): File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/opt/conda/lib/python3.10/threading.py", line 953, in run self._target(self._args, **self._kwargs) File "/opt/conda/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/opt/conda/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message