Open kkjh0723 opened 4 years ago
Not sure if this would help, but I encountered the same issue and had to rollback to an earlier version of apex:
git checkout f3a960f80244cf9e80558ab30f7f7e8cbf03c0a0
@da03 thanks, I can use this workaround temporarily. Hopefully, the issue will be addressed in the latest commit.
I also met the same error! Solved by @da03's solution.
I had the same issue, also solved by @da03's solution.
Machine configuration: CentOS 7, GCC 7.3, Pytorch 1.5, CUDA 9.2. It's on a shared resource, so upgrading is not possible.
Not sure if this would help, but I encountered the same issue and had to rollback to an earlier version of apex:
git checkout f3a960f80244cf9e80558ab30f7f7e8cbf03c0a0
Same error with cuda 9.0, torch 1.1.0. Thanks to @da03 , saved my life!!!
Not sure if this would help, but I encountered the same issue and had to rollback to an earlier version of apex:
git checkout f3a960f80244cf9e80558ab30f7f7e8cbf03c0a0
Thank you so much. Helped a lot
Not sure if this would help, but I encountered the same issue and had to rollback to an earlier version of apex:
git checkout f3a960f80244cf9e80558ab30f7f7e8cbf03c0a0
Saved my day! Thanks!
Same error solved by @da03's solution and apex was installed successfully,but a new issue arises when running with apex.
File "run_classifier.py", line 107, in train
model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
File "/home/jin/anaconda2/envs/bl-36/lib/python3.6/site-packages/apex/amp/frontend.py", line 358, in initialize
return _initialize(models, optimizers, _amp_state.opt_properties, num_losses, cast_model_outputs)
File "/home/jin/anaconda2/envs/bl-36/lib/python3.6/site-packages/apex/amp/_initialize.py", line 225, in _initialize
optimizers[i] = _process_optimizer(optimizer, properties)
File "/home/jin/anaconda2/envs/bl-36/lib/python3.6/site-packages/apex/amp/_process_optimizer.py", line 344, in _process_optimizer
optimizer._amp_stash.dummy_overflow_buf = torch.cuda.IntTensor([0]);
RuntimeError: CUDA error: unknown error
Configuration: Cuda9.0, gcc 4.8, pytorch 1.1.0, CentOS 7.
Thanks, @da03 !
Not sure if this would help, but I encountered the same issue and had to rollback to an earlier version of apex:
git checkout f3a960f80244cf9e80558ab30f7f7e8cbf03c0a0
Thanks @da03. This worked for me as well. I tried building pytorch from source but was getting an error there as well. Nothing else worked!
Thanks @da03 ! You saved my end-term project!
My environment:
Thanks @da03. Your awesome finding saved me a lot of trouble. My dev env looks like:
I am also seeing this error, the env is:
I find that at least commit 5b71d3695bf39
can compile without errors. Here is what I do:
git clone https://github.com/NVIDIA/apex.git && \
cd apex && \
git checkout 5b71d3695bf39 && \
python setup.py install --cuda_ext --cpp_ext
It can compile without errors.
Thanks @da03 ! My env:
I'm still encountering the issue even with the suggested rollback (git checkout f3a960f80244cf9e80558ab30f7f7e8cbf03c0a0
) - Any suggestions?
Update: I was having issues with pytorch 1.5.1, but they went away when I downgraded to pytorch 1.4 (and rolling back with git checkout 5b71d3695bf39
)
@da03 thanks, help a lot
THX, save my day. :smile:
My Env Info:
Not sure if this would help, but I encountered the same issue and had to rollback to an earlier version of apex:
git checkout f3a960f80244cf9e80558ab30f7f7e8cbf03c0a0
Thanks @da03 ! It dose work!
}
^
/home/hadoop-basecv/.local/lib/python3.6/site-packages/torch/include/ATen/core/dispatch/Dispatcher.h: In member function ‘Return c10::Dispatcher::callUnboxedOnly(const c10::OperatorHandle&, Args ...) const [with Return = at::Tensor; Args = {const at::Tensor&, c10::ArrayRef<long int>, c10::ArrayRef<long int>, c10::ArrayRef<long int>, c10::ArrayRef<long int>}]’:
/home/hadoop-basecv/.local/lib/python3.6/site-packages/torch/include/ATen/core/dispatch/Dispatcher.h:203:1: warning: control reaches end of non-void function [-Wreturn-type]
}
^
/home/hadoop-basecv/.local/lib/python3.6/site-packages/torch/include/ATen/core/dispatch/Dispatcher.h: In member function ‘Return c10::Dispatcher::doCallUnboxed(const c10::DispatchTable&, const c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction> >&, Args ...) const [with Return = bool; Args = {}]’:
/home/hadoop-basecv/.local/lib/python3.6/site-packages/torch/include/ATen/core/dispatch/Dispatcher.h:191:1: warning: control reaches end of non-void function [-Wreturn-type]
}
^
error: command 'gcc' failed with exit status 1
Running setup.py install for apex ... error
ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-av9m897s/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-av9m897s/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-2aev5wdr/install-record.txt --single-version-externally-managed --user --prefix= --compile --install-headers /home/hadoop-basecv/.local/include/python3.6m/apex Check the logs for full command output.
Exception information:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/pip/_internal/req/req_install.py", line 854, in install
req_description=str(self.req),
File "/usr/local/lib/python3.6/site-packages/pip/_internal/operations/install/legacy.py", line 86, in install
raise LegacyInstallFailure
pip._internal.operations.install.legacy.LegacyInstallFailure
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/pip/_internal/cli/base_command.py", line 224, in _main
status = self.run(options, args)
File "/usr/local/lib/python3.6/site-packages/pip/_internal/cli/req_command.py", line 180, in wrapper
return func(self, options, args)
File "/usr/local/lib/python3.6/site-packages/pip/_internal/commands/install.py", line 403, in run
pycompile=options.compile,
File "/usr/local/lib/python3.6/site-packages/pip/_internal/req/__init__.py", line 90, in install_given_reqs
pycompile=pycompile,
File "/usr/local/lib/python3.6/site-packages/pip/_internal/req/req_install.py", line 858, in install
six.reraise(*exc.parent)
File "/usr/local/lib/python3.6/site-packages/pip/_vendor/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/pip/_internal/operations/install/legacy.py", line 76, in install
cwd=unpacked_source_directory,
File "/usr/local/lib/python3.6/site-packages/pip/_internal/utils/subprocess.py", line 275, in runner
spinner=spinner,
File "/usr/local/lib/python3.6/site-packages/pip/_internal/utils/subprocess.py", line 240, in call_subprocess
raise InstallationError(exc_msg)
pip._internal.exceptions.InstallationError: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-av9m897s/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-av9m897s/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-2aev5wdr/install-record.txt --single-version-externally-managed --user --prefix= --compile --install-headers /home/hadoop-basecv/.local/include/python3.6m/apex Check the logs for full command output.
None of checkouts worked for me. I am with cuda10.0 and py1.4.
Updated to gcc7 and it worked. It seems gcc 4.8.5 doesn't go with it.
Updated to gcc7 and it worked. It seems gcc 4.8.5 doesn't go with it.
I think you are right!!!
Ubuntu 18.04 LTS CUDA 11.2 pytorch 1.8.1+cu111 gcc 7.5 求大佬帮我,整了一天了 cuda toolkit 11.1
Not sure if this would help, but I encountered the same issue and had to rollback to an earlier version of apex:
git checkout f3a960f80244cf9e80558ab30f7f7e8cbf03c0a0
It works for me! It is absolutely gorgeous! Thanks for saving my time! Configuration: cuda 10.0, pytorch1.1, python3.7
I am also seeing this error, the env is:
- Ubuntu 18.04
- CUDA 10.0.130
- pytoch 1.2.0
- gcc 7.4.0
I find that at least commit
5b71d3695bf39
can compile without errors. Here is what I do:git clone https://github.com/NVIDIA/apex.git && \ cd apex && \ git checkout 5b71d3695bf39 && \ python setup.py install --cuda_ext --cpp_ext
It can compile without errors.
python 3.6
git clone https://github.com/NVIDIA/apex.git && \
cd apex && \
git checkout 5b71d3695bf39 && \
python setup.py install
Avoided Python 3.6 annotations error and the --cuda_ext and --cpp_ext options require Torch version > 1.0.
I'm trying to update latest
apex
on the system with Cuda9.1, pytorch 1.1.0, ubuntu16.04. and I got the error attached at the end.Actually, I recently got a problem that my model performance degrades significantly after I updated my docker image from cuda9.1-pytorch1.1.0-old_apex to cuda9.2-pytorch1.4.0-latest_apex. I want to check whether the issue came from updating pytorch or updating apex.
So hopefully there is any way to install on current system (Cuda9.1, pytorch 1.1.0, ubuntu16.04.) without changing the cuda and pytorch version.