JonasGeiping / cramming

Cramming the training of a (BERT-type) language model into limited compute.
MIT License
1.29k stars 100 forks source link

torch._dynamo error on step 2: calling compiler function 'inductor' #39

Closed ionutmodo closed 9 months ago

ionutmodo commented 9 months ago

Hi,

I am trying to replicate the final recipe by running python pretrain.py name=amp_b8192_cb_o4_final arch=crammed-bert train=bert-o4 data=pile-readymade as explained in the README file and I am getting the following error: torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: FileNotFoundError: [Errno 2] No such file or directory: 'ldconfig'. The error message suggests me to set the environment variables TORCH_LOGS="+dynamo" TORCHDYNAMO_VERBOSE=1 which I did and the error message is shown in the box below. Please help me figure out how to solve this issue related to ldconfig. I could not find a solution to this on the web.

[2023-12-19 17:44:59,958] [0/0] torch._dynamo.output_graph: [INFO] Step 2: calling compiler function inductor
Error executing job with overrides: ['name=amp_b8192_cb_o4_final', 'arch=crammed-bert', 'train=bert-o4', 'data=pile-readymade']
Traceback (most recent call last):
  File "/nfs/scistore19/alistgrp/imodoran/workplace/M-FAC_extensions/cramming/pretrain.py", line 199, in launch
    cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining")
  File "/nfs/scistore19/alistgrp/imodoran/workplace/M-FAC_extensions/cramming/cramming/utils.py", line 54, in main_launcher
    metrics = main_fn(cfg, setup)                                                                                                                                                                                                                                                                                       File "/nfs/scistore19/alistgrp/imodoran/workplace/M-FAC_extensions/cramming/pretrain.py", line 55, in main_training_process
    loss = model_engine.step(device_batch)                                                                                                                                                                                                                                                                              File "/nfs/scistore19/alistgrp/imodoran/workplace/M-FAC_extensions/cramming/cramming/backend/torch_default.py", line 124, in step
    loss = self.forward(**batch)["loss"]
  File "/nfs/scistore19/alistgrp/imodoran/workplace/M-FAC_extensions/cramming/cramming/backend/torch_default.py", line 140, in forward
    return self.model(*inputs, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)                                                                                                                                                                                                                                                                             File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)                                                                                                                                                                                                                                                                                File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
    return fn(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 133, in _fn
    return fn(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 389, in _convert_frame_assert
    return _compile(
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 569, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 491, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
    transformations(instructions, code_options)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 458, in transform
    tracer.run()
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2069, in run
    super().run()
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 719, in run
    and self.step()
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 683, in step
    getattr(self, inst.opname)(inst)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2157, in RETURN_VALUE
    self.output.compile_subgraph(
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 857, in compile_subgraph
    self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 957, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1024, in call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1009, in call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/__init__.py", line 1568, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 961, in compile_fx
    return compile_fx(
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1150, in compile_fx
    return aot_autograd(
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/backends/common.py", line 55, in compiler_fn
    cg = aot_module_simplified(gm, example_inputs, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 3891, in aot_module_simplified
    compiled_fn = create_aot_dispatcher_function(
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 3429, in create_aot_dispatcher_function
    compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 2212, in aot_wrapper_dedupe
    return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 2392, in aot_wrapper_synthetic_base
    return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 2917, in aot_dispatch_autograd
    compiled_fw_func = aot_config.fw_compiler(
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1092, in fw_compiler_base
    return inner_compile(
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/repro/after_aot.py", line 80, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/debug.py", line 228, in inner
    return fn(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 54, in newFunction
    return old_func(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 341, in compile_fx_inner
    compiled_graph: CompiledFxGraph = fx_codegen_and_compile(
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 565, in fx_codegen_and_compile
    compiled_fn = graph.compile_to_fn()
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/graph.py", line 970, in compile_to_fn
    return self.compile_to_module().call
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/graph.py", line 941, in compile_to_module
    mod = PyCodeCache.load_by_key_path(key, path, linemap=linemap)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 1139, in load_by_key_path
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/torchinductor_imodoran/k6/ck6fiae7msa7cgviyukidcm4bynb5bjdai7xz5hbv7tswlzqpxba.py", line 1127, in <module>
    async_compile.wait(globals())
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 1418, in wait
    scope[key] = result.result()
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 1277, in result
    self.future.result()
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
FileNotFoundError: [Errno 2] No such file or directory: 'ldconfig'
JonasGeiping commented 9 months ago

This looks like an OS problem, what's your OS?

Also, does a simple sanity check compile? Like shown here: https://pytorch.org/docs/stable/generated/torch.compile.html

JonasGeiping commented 9 months ago

And, have you tried just adding ldconfig to your path?

ionutmodo commented 9 months ago

I am running on a cluster which I do not manage.

This is the output of uname -a showing my OS details: Linux gpu238 5.10.0-26-amd64 #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64 GNU/Linux.

This is a sanity check compile:

Python 3.9.18 (main, Sep 11 2023, 13:41:44)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> @torch.compile(options={"triton.cudagraphs": True}, fullgraph=True)
... def foo(x):
...     return torch.sin(x) + torch.cos(x)
...
>>> torch.pi
3.141592653589793
>>> a = torch.tensor([torch.pi], dtype=torch.float)
>>> foo(a)
skipping cudagraphs for unknown reason
tensor([-1.0000])
>>> foo(a)
tensor([-1.0000])
>>> foo(a)
tensor([-1.0000])
>>>
JonasGeiping commented 9 months ago

skipping cudagraphs for unknown reasons sounds a bit suspicious.

Does the sanity check also work with options={"triton.cudagraphs": True}, "permute_fusion": True, "shape_padding": True}, with tensors on GPU?

P.S: Does ldconfig actually exist ( ldconfig --usage)?

ionutmodo commented 9 months ago

it seems like ldconfig does not exist, but when I type man ldconfig, I get the usage instructions. I will get in touch with my system administrator. Thank you for your help, I will get back to you once we solve this issue.

ionutmodo commented 9 months ago

The ldconfig application is in /sbin and the solution is to add this folder to PATH: export PATH=/sbin:$PATH

JonasGeiping commented 9 months ago

Great! I'm still a bit surprised that this isn't on your debian path by default, but I'm glad this works.