MIC-DKFZ / nnUNet

Apache License 2.0
5.91k stars 1.76k forks source link

Training the model using self-built 2d data was not successful #2578

Open TengfeiHe opened 2 weeks ago

TengfeiHe commented 2 weeks ago

I used the 2d slices of ACDC as the training set to train the nnUNetv2 model, but the following error occurred.

Traceback (most recent call last): File "/data/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1446, in _call_user_compiler compiled_fn = compiler_fn(gm, self.example_inputs()) File "/data/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/repro/after_dynamo.py", line 129, in call compiled_gm = compiler_fn(gm, example_inputs) File "/data/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/init.py", line 2234, in call return compilefx(model, inputs_, config_patches=self.config) File "/data/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1521, in compile_fx return aot_autograd( File "/data/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/backends/common.py", line 72, in call cg = aot_module_simplified(gm, example_inputs, self.kwargs) File "/data/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 1071, in aot_module_simplified compiled_fn = dispatch_and_compile() File "/data/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 1056, in dispatch_and_compile compiledfn, = create_aot_dispatcher_function( File "/data/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 522, in create_aot_dispatcher_function return _create_aot_dispatcher_function( File "/data/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 759, in _create_aot_dispatcher_function compiled_fn, fw_metadata = compiler_fn( File "/data/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 588, in aot_dispatch_autograd compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args) File "/data/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1350, in fw_compiler_base return _fw_compiler_base(model, example_inputs, is_inference) File "/data/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1421, in _fw_compiler_base return inner_compile( File "/data/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 475, in compile_fx_inner return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")( File "/data/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/repro/after_aot.py", line 85, in debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 661, in _compile_fx_inner compiled_graph = FxGraphCache.load( File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 1334, in load compiled_graph = compile_fx_fn( File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 570, in codegen_and_compile compiled_graph = fx_codegen_and_compile(gm, example_inputs, fx_kwargs) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 878, in fx_codegen_and_compile compiled_fn = graph.compile_to_fn() File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1913, in compile_to_fn return self.compile_to_module().call File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1839, in compile_to_module return self._compile_to_module() File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1867, in _compile_to_module mod = PyCodeCache.load_by_key_path( File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2876, in load_by_key_path mod = _reload_python_module(key, path) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_inductor/runtime/compile_tasks.py", line 45, in _reload_python_module exec(code, mod.dict, mod.dict) File "/tmp/torchinductor_user/tt/cttfn4zlytxcz5pojbb3btfil63qrgq5aefmvvhng2fkh7b4ffz7.py", line 1925, in async_compile.wait(globals()) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_inductor/async_compile.py", line 276, in wait scope[key] = result.result() File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 3344, in result self.kernel.precompile() File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 244, in precompile compiled_binary, launcher = self._precompile_config( File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 452, in _precompile_config binary._init_handles() File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/triton/compiler/compiler.py", line 376, in _init_handles self.module, self.function, self.n_regs, self.n_spills = driver.active.utils.load_binary( RuntimeError: Triton Error [CUDA]: device kernel image is invalid

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/data/user/user/anaconda3/envs/nnunet/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/data/user/user/project/CV/nnunet/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/data/user/user/project/CV/nnunet/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training nnunet_trainer.run_training() File "/data/user/user/project/CV/nnunet/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1370, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "/data/user/user/project/CV/nnunet/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 994, in train_step output = self.network(data) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, *kwargs) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 465, in _fn return fn(args, kwargs) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, *kwargs) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 1269, in call return self._torchdynamo_orig_callable( File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 1064, in call result = self._inner_convert( File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 526, in call return _compile( File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 924, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 666, in compile_inner return _compile_inner(code, one_graph, hooks, transform) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_utils_internal.py", line 87, in wrapper_function return function(args, kwargs) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 699, in _compile_inner out_code = transform_code_object(code, transform) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/bytecode_transformation.py", line 1322, in transform_code_object transformations(instructions, code_options) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 219, in _fn return fn(*args, **kwargs) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 634, in transform tracer.run() File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2796, in run super().run() File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 983, in run while self.step(): File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 895, in step self.dispatch_table[inst.opcode](self, inst) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2987, in RETURN_VALUE self._return(inst) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2972, in _return self.output.compile_subgraph( File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1142, in compile_subgraph self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1369, in compile_and_call_fx_graph compiled_fn = self.call_user_compiler(gm) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1416, in call_user_compiler return self._call_user_compiler(gm) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1465, in _call_user_compiler raise BackendCompilerFailed(self.compiler_fn, e) from e torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: RuntimeError: Triton Error [CUDA]: device kernel image is invalid

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True

Exception in thread Thread-1: Traceback (most recent call last): File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/threading.py", line 917, in run self._target(*self._args, *self._kwargs) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Exception in thread Thread-2: Traceback (most recent call last): File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/threading.py", line 917, in run self._target(self._args, **self._kwargs) File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/data/user/user/anaconda3/envs/nnunet/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message