Closed ionutmodo closed 10 months ago
This looks like an OS problem, what's your OS?
Also, does a simple sanity check compile? Like shown here: https://pytorch.org/docs/stable/generated/torch.compile.html
And, have you tried just adding ldconfig
to your path?
I am running on a cluster which I do not manage.
This is the output of uname -a
showing my OS details: Linux gpu238 5.10.0-26-amd64 #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64 GNU/Linux
.
This is a sanity check compile:
Python 3.9.18 (main, Sep 11 2023, 13:41:44)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> @torch.compile(options={"triton.cudagraphs": True}, fullgraph=True)
... def foo(x):
... return torch.sin(x) + torch.cos(x)
...
>>> torch.pi
3.141592653589793
>>> a = torch.tensor([torch.pi], dtype=torch.float)
>>> foo(a)
skipping cudagraphs for unknown reason
tensor([-1.0000])
>>> foo(a)
tensor([-1.0000])
>>> foo(a)
tensor([-1.0000])
>>>
skipping cudagraphs for unknown reasons
sounds a bit suspicious.
Does the sanity check also work with options={"triton.cudagraphs": True}, "permute_fusion": True, "shape_padding": True}
, with tensors on GPU?
P.S: Does ldconfig
actually exist ( ldconfig --usage
)?
it seems like ldconfig
does not exist, but when I type man ldconfig
, I get the usage instructions. I will get in touch with my system administrator. Thank you for your help, I will get back to you once we solve this issue.
The ldconfig
application is in /sbin
and the solution is to add this folder to PATH: export PATH=/sbin:$PATH
Great! I'm still a bit surprised that this isn't on your debian path by default, but I'm glad this works.
Hi,
I am trying to replicate the final recipe by running
python pretrain.py name=amp_b8192_cb_o4_final arch=crammed-bert train=bert-o4 data=pile-readymade
as explained in the README file and I am getting the following error:torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: FileNotFoundError: [Errno 2] No such file or directory: 'ldconfig'
. The error message suggests me to set the environment variablesTORCH_LOGS="+dynamo" TORCHDYNAMO_VERBOSE=1
which I did and the error message is shown in the box below. Please help me figure out how to solve this issue related toldconfig
. I could not find a solution to this on the web.