IsoNet-cryoET / spIsoNet

Overcoming the preferred orientation problem in cryoEM with self-supervised deep-learning
https://www.biorxiv.org/content/10.1101/2024.04.11.588921v1
MIT License
17 stars 4 forks source link

Torch Problem! #6

Closed bassemmohammed closed 5 months ago

bassemmohammed commented 5 months ago

I am getting this error. what is a potential solution?

`04-16 18:50:06, INFO Start preparing subvolumes! 04-16 18:50:14, INFO Done preparing subvolumes! 04-16 18:50:14, INFO Start training! 04-16 18:50:16, INFO Port number: 55179 learning rate 0.0003 ['isonet_maps/J527_004_volume_map_half_A_data', 'isonet_maps/J527_004_volume_map_half_B_data'] 0%| | 0/125 [00:00<?, ?batch/s][rank3]:[2024-04-16 18:50:31,316] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored [rank0]:[2024-04-16 18:50:31,318] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored [rank1]:[2024-04-16 18:50:31,319] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored [rank2]:[2024-04-16 18:50:31,336] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored /tmp/tmpsmozudmb/main.c: In function ‘list_to_cuuint64_array’: /tmp/tmpsmozudmb/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmpsmozudmb/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code /tmp/tmpsmozudmb/main.c: In function ‘list_to_cuuint32_array’: /tmp/tmpsmozudmb/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmpp2i9rprr/main.c: In function ‘list_to_cuuint64_array’: /tmp/tmpp2i9rprr/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmpp2i9rprr/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code /tmp/tmpp2i9rprr/main.c: In function ‘list_to_cuuint32_array’: /tmp/tmpp2i9rprr/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmpry1_48xe/main.c: In function ‘list_to_cuuint64_array’: /tmp/tmpry1_48xe/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmpry1_48xe/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code /tmp/tmpry1_48xe/main.c: In function ‘list_to_cuuint32_array’: /tmp/tmpry1_48xe/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmpjvtra_tg/main.c: In function ‘list_to_cuuint64_array’: /tmp/tmpjvtra_tg/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmpjvtra_tg/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code /tmp/tmpjvtra_tg/main.c: In function ‘list_to_cuuint32_array’: /tmp/tmpjvtra_tg/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmpmmftkvn2/main.c: In function ‘list_to_cuuint64_array’: /tmp/tmpmmftkvn2/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmpmmftkvn2/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code /tmp/tmpmmftkvn2/main.c: In function ‘list_to_cuuint32_array’: /tmp/tmpmmftkvn2/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmpbi1c9pol/main.c: In function ‘list_to_cuuint64_array’: /tmp/tmpbi1c9pol/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmpbi1c9pol/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code /tmp/tmpbi1c9pol/main.c: In function ‘list_to_cuuint32_array’: /tmp/tmpbi1c9pol/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmpnhv369rn/main.c: In function ‘list_to_cuuint64_array’: /tmp/tmpnhv369rn/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmpnhv369rn/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code /tmp/tmpnhv369rn/main.c: In function ‘list_to_cuuint32_array’: /tmp/tmpnhv369rn/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmpk6tpnhg2/main.c: In function ‘list_to_cuuint64_array’: /tmp/tmpk6tpnhg2/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmpk6tpnhg2/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code /tmp/tmpk6tpnhg2/main.c: In function ‘list_to_cuuint32_array’: /tmp/tmpk6tpnhg2/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmplgr2pn1x/main.c: In function ‘list_to_cuuint64_array’: /tmp/tmplgr2pn1x/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmplgr2pn1x/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code /tmp/tmplgr2pn1x/main.c: In function ‘list_to_cuuint32_array’: /tmp/tmplgr2pn1x/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmpfso28o05/main.c: In function ‘list_to_cuuint64_array’: /tmp/tmpfso28o05/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmpfso28o05/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code /tmp/tmpfso28o05/main.c: In function ‘list_to_cuuint32_array’: /tmp/tmpfso28o05/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmpau0cv_zl/main.c: In function ‘list_to_cuuint64_array’: /tmp/tmpau0cv_zl/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmpau0cv_zl/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code /tmp/tmpau0cv_zl/main.c: In function ‘list_to_cuuint32_array’: /tmp/tmpau0cv_zl/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmp2hxvb5sf/main.c: In function ‘list_to_cuuint64_array’: /tmp/tmp2hxvb5sf/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmp2hxvb5sf/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code /tmp/tmp2hxvb5sf/main.c: In function ‘list_to_cuuint32_array’: /tmp/tmp2hxvb5sf/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ 0%| | 0/125 [00:12<?, ?batch/s] Traceback (most recent call last): File "/home/bassem.mohammed/.conda/envs/spisonet/bin/spisonet.py", line 8, in sys.exit(main()) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 549, in main fire.Fire(ISONET) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 143, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 182, in reconstruct map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta, voxel_size=voxel_size, output_dir=output_dir, File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000, File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta, File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes while not context.join(): File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 158, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 3 terminated with the following error: Traceback (most recent call last): File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap fn(i, args) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 116, in ddp_train preds = model(x1)# + noise.cuda()) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn return fn(args, kwargs) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner return fn(*args, kwargs) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1523, in forward else self._run_ddp_forward(*inputs, kwargs) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward return self.module(*inputs, *kwargs) # type: ignore[index] File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/unet.py", line 97, in forward x, down_sampling_features = self.encoder(x) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/unet.py", line 98, in resume_in_forward x = self.decoder(x, down_sampling_features) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 652, in catch_errors return hijacked_callback(frame, cache_entry, hooks, frame_state) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 727, in _convert_frame result = inner_convert(frame, cache_entry, hooks, frame_state) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 383, in _convert_frame_assert compiled_product = _compile( File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 646, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 244, in time_wrapper r = func(*args, *kwargs) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 562, in compile_inner out_code = transform_code_object(code, transform) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object transformations(instructions, code_options) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 151, in _fn return fn(args, kwargs) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 527, in transform tracer.run() File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2128, in run super().run() File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 818, in run and self.step() File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 781, in step getattr(self, inst.opname)(inst) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2243, in RETURN_VALUE self.output.compile_subgraph( File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 919, in compile_subgraph self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/contextlib.py", line 79, in inner return func(*args, kwds) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1087, in compile_and_call_fx_graph compiled_fn = self.call_user_compiler(gm) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 244, in time_wrapper r = func(args, kwargs) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1159, in call_user_compiler raise BackendCompilerFailed(self.compiler_fn, e).with_traceback( File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1140, in call_user_compiler compiled_fn = compiler_fn(gm, self.example_inputs()) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/backends/distributed.py", line 312, in compile_fn return self.backend_compile_fn(gm, example_inputs) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper compiled_gm = compiler_fn(gm, example_inputs) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/init.py", line 1668, in call return compilefx(model, inputs_, config_patches=self.config) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1168, in compile_fx return aot_autograd( File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/backends/common.py", line 55, in compiler_fn cg = aot_module_simplified(gm, example_inputs, kwargs) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 887, in aot_module_simplified compiled_fn = create_aot_dispatcher_function( File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 244, in time_wrapper r = func(args, kwargs) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 600, in create_aot_dispatcher_function compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 425, in aot_wrapper_dedupe return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 630, in aot_wrapper_synthetic_base return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 295, in aot_dispatch_autograd compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 244, in time_wrapper r = func(*args, kwargs) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1100, in fw_compiler_base return inner_compile( File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/repro/after_aot.py", line 83, in debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/debug.py", line 305, in inner return fn(*args, *kwargs) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/contextlib.py", line 79, in inner return func(args, kwds) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 320, in compile_fx_inner compiled_graph = fx_codegen_and_compile( File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 550, in fx_codegen_and_compile compiled_fn = graph.compile_to_fn() File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1116, in compile_to_fn return self.compile_to_module().call File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 244, in time_wrapper r = func(*args, **kwargs) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1070, in compile_to_module mod = PyCodeCache.load_by_key_path( File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 1892, in load_by_key_path exec(code, mod.dict, mod.dict) File "/tmp/torchinductor_bassem.mohammed/3s/c3sj43rjwf7es4dcg6diqwdbggn4vewhnvf5urwrpeue3m2txik5.py", line 67, in async_compile.wait(globals()) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 2486, in wait scope[key] = result.result() File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 2330, in result kernel = self.kernel = _load_kernel(self.kernel_name, self.source_code) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 2306, in _load_kernel kernel.precompile() File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/triton_heuristics.py", line 188, in precompile compiled_binary, launcher = self._precompile_config( File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/triton_heuristics.py", line 308, in _precompile_config binary._init_handles() File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/triton/compiler/compiler.py", line 670, in _init_handles bin_path = {driver.HIP: "hsaco_path", driver.CUDA: "cubin"}[driver.backend] File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/triton/runtime/driver.py", line 157, in getattr self._initialize_obj() File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/triton/runtime/driver.py", line 154, in _initialize_obj self._obj = self._init_fn() File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/triton/runtime/driver.py", line 187, in initialize_driver return CudaDriver() File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/triton/runtime/driver.py", line 77, in init self.utils = CudaUtils() File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/triton/runtime/driver.py", line 47, in init so = _build("cuda_utils", src_path, tmpdir) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/triton/common/build.py", line 106, in _build ret = subprocess.check_call(cc_cmd) File "/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/subprocess.py", line 369, in check_call raise CalledProcessError(retcode, cmd) torch._dynamo.exc.BackendCompilerFailed: backend='compile_fn' raised: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpau0cv_zl/main.c', '-O3', '-I/home/bassem.mohammed/.conda/envs/spisonet/lib/python3.10/site-packages/triton/common/../third_party/cuda/include', '-I/home/bassem.mohammed/.conda/envs/spisonet/include/python3.10', '-I/tmp/tmpau0cv_zl', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpau0cv_zl/cuda_utils.cpython-310-x86_64-linux-gnu.so', '-L/lib64', '-L/lib', '-L/lib64', '-L/lib']' returned non-zero exit status 1.

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True`

DanGonite57 commented 5 months ago

What version of gcc do you have? $ gcc -v

I had the same issue, and appear to have solved it by switching from gcc 4.8.5 to gcc 7.X

bassemmohammed commented 5 months ago

I still got [rank0]:[2024-04-17 09:37:25,757] [0/1] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored

But it it proceeded to the training and worked out. thank you!

procyontao commented 5 months ago

The "[rank0]:[2024-04-17 09:37:25,757] [0/1] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored" occurs every time but does not affect the execution.