ai-forever / ru-gpts

Russian GPT3 models.
Apache License 2.0
2.08k stars 445 forks source link

Не устанавливается apex #83

Closed Pro100rus32 closed 2 years ago

Pro100rus32 commented 2 years ago

Попытка выполнить код из колаба - https://colab.research.google.com/github/sberbank-ai/ru-gpts/blob/master/examples/Finetune_and_generate_RuGPTs_deepspeed_megatron.ipynb

Cloning into 'apex'...
remote: Enumerating objects: 8717, done.
remote: Counting objects: 100% (1085/1085), done.
remote: Compressing objects: 100% (246/246), done.
remote: Total 8717 (delta 963), reused 839 (delta 839), pack-reused 7632
Receiving objects: 100% (8717/8717), 14.38 MiB | 21.06 MiB/s, done.
Resolving deltas: 100% (5959/5959), done.
/usr/local/lib/python3.7/dist-packages/pip/_internal/commands/install.py:232: UserWarning: Disabling all use of wheels due to the use of --build-option / --global-option / --install-option.
  cmdoptions.check_install_build_global(options)
Using pip 21.1.3 from /usr/local/lib/python3.7/dist-packages/pip (python 3.7)
Value for scheme.platlib does not match. Please report this to <https://github.com/pypa/pip/issues/9617>
distutils: /usr/local/lib/python3.7/dist-packages
sysconfig: /usr/lib/python3.7/site-packages
Value for scheme.purelib does not match. Please report this to <https://github.com/pypa/pip/issues/9617>
distutils: /usr/local/lib/python3.7/dist-packages
sysconfig: /usr/lib/python3.7/site-packages
Value for scheme.headers does not match. Please report this to <https://github.com/pypa/pip/issues/9617>
distutils: /usr/local/include/python3.7/UNKNOWN
sysconfig: /usr/include/python3.7m/UNKNOWN
Value for scheme.scripts does not match. Please report this to <https://github.com/pypa/pip/issues/9617>
distutils: /usr/local/bin
sysconfig: /usr/bin
Value for scheme.data does not match. Please report this to <https://github.com/pypa/pip/issues/9617>
distutils: /usr/local
sysconfig: /usr
Additional context:
user = False
home = None
root = None
prefix = None
Non-user install because site-packages writeable
Created temporary directory: /tmp/pip-ephem-wheel-cache-v_9w7hje
Created temporary directory: /tmp/pip-req-tracker-1cxrmd0x
Initialized build tracking at /tmp/pip-req-tracker-1cxrmd0x
Created build tracker: /tmp/pip-req-tracker-1cxrmd0x
Entered build tracker: /tmp/pip-req-tracker-1cxrmd0x
Created temporary directory: /tmp/pip-install-ge5e_31u
Processing /content/apex
  Created temporary directory: /tmp/pip-req-build-ttk52xp1
  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.
  Added file:///content/apex to build tracker '/tmp/pip-req-tracker-1cxrmd0x'
    Running setup.py (path:/tmp/pip-req-build-ttk52xp1/setup.py) egg_info for package from file:///content/apex
    Created temporary directory: /tmp/pip-pip-egg-info-hdlheg5p
    Running command python setup.py egg_info

    torch.__version__  = 1.7.1+cu110

    running egg_info
    creating /tmp/pip-pip-egg-info-hdlheg5p/apex.egg-info
    writing /tmp/pip-pip-egg-info-hdlheg5p/apex.egg-info/PKG-INFO
    writing dependency_links to /tmp/pip-pip-egg-info-hdlheg5p/apex.egg-info/dependency_links.txt
    writing top-level names to /tmp/pip-pip-egg-info-hdlheg5p/apex.egg-info/top_level.txt
    writing manifest file '/tmp/pip-pip-egg-info-hdlheg5p/apex.egg-info/SOURCES.txt'
    adding license file 'LICENSE'
    writing manifest file '/tmp/pip-pip-egg-info-hdlheg5p/apex.egg-info/SOURCES.txt'
    /tmp/pip-req-build-ttk52xp1/setup.py:67: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
      warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")
  Source in /tmp/pip-req-build-ttk52xp1 has version 0.1, which satisfies requirement apex==0.1 from file:///content/apex
  Removed apex==0.1 from file:///content/apex from build tracker '/tmp/pip-req-tracker-1cxrmd0x'
Created temporary directory: /tmp/pip-unpack-84jikymv
Skipping wheel build for apex, due to binaries being disabled for it.
Installing collected packages: apex
  Value for scheme.platlib does not match. Please report this to <https://github.com/pypa/pip/issues/9617>
  distutils: /usr/local/lib/python3.7/dist-packages
  sysconfig: /usr/lib/python3.7/site-packages
  Value for scheme.purelib does not match. Please report this to <https://github.com/pypa/pip/issues/9617>
  distutils: /usr/local/lib/python3.7/dist-packages
  sysconfig: /usr/lib/python3.7/site-packages
  Value for scheme.headers does not match. Please report this to <https://github.com/pypa/pip/issues/9617>
  distutils: /usr/local/include/python3.7/apex
  sysconfig: /usr/include/python3.7m/apex
  Value for scheme.scripts does not match. Please report this to <https://github.com/pypa/pip/issues/9617>
  distutils: /usr/local/bin
  sysconfig: /usr/bin
  Value for scheme.data does not match. Please report this to <https://github.com/pypa/pip/issues/9617>
  distutils: /usr/local
  sysconfig: /usr
  Additional context:
  user = False
  home = None
  root = None
  prefix = None
  Created temporary directory: /tmp/pip-record-0hyg8dtt
    Running command /usr/bin/python3 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-ttk52xp1/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-ttk52xp1/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-0hyg8dtt/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.7/apex

    torch.__version__  = 1.7.1+cu110

    /tmp/pip-req-build-ttk52xp1/setup.py:67: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
      warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")

    Compiling cuda extensions with
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2020 NVIDIA Corporation
    Built on Mon_Oct_12_20:09:46_PDT_2020
    Cuda compilation tools, release 11.1, V11.1.105
    Build cuda_11.1.TC455_06.29190527_0
    from /usr/local/cuda/bin

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-req-build-ttk52xp1/setup.py", line 159, in <module>
        check_cuda_torch_binary_vs_bare_metal(CUDA_HOME)
      File "/tmp/pip-req-build-ttk52xp1/setup.py", line 103, in check_cuda_torch_binary_vs_bare_metal
        "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  "
    RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.  Pytorch binaries were compiled with Cuda 11.0.
    In some cases, a minor-version mismatch will not cause later errors:  https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  You can try commenting out this check (at your own risk).
    Running setup.py install for apex ... error
ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-ttk52xp1/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-ttk52xp1/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-0hyg8dtt/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.7/apex Check the logs for full command output.
Exception information:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/req/req_install.py", line 825, in install
    req_description=str(self.req),
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/operations/install/legacy.py", line 81, in install
    raise LegacyInstallFailure
pip._internal.operations.install.legacy.LegacyInstallFailure

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/cli/base_command.py", line 180, in _main
    status = self.run(options, args)
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/cli/req_command.py", line 199, in wrapper
    return func(self, options, args)
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/commands/install.py", line 402, in run
    pycompile=options.compile,
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/req/__init__.py", line 85, in install_given_reqs
    pycompile=pycompile,
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/req/req_install.py", line 829, in install
    six.reraise(*exc.parent)
  File "/usr/local/lib/python3.7/dist-packages/pip/_vendor/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/operations/install/legacy.py", line 71, in install
    cwd=unpacked_source_directory,
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/utils/subprocess.py", line 278, in runner
    spinner=spinner,
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/utils/subprocess.py", line 244, in call_subprocess
    raise InstallationSubprocessError(proc.returncode, command_desc)
pip._internal.exceptions.InstallationSubprocessError: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-ttk52xp1/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-ttk52xp1/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-0hyg8dtt/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.7/apex Check the logs for full command output.
Removed build tracker: '/tmp/pip-req-tracker-1cxrmd0x'

Логично, что потом уже ничего не работает:

Traceback (most recent call last):
  File "ru-gpts/pretrain_gpt3.py", line 26, in <module>
    from apex.optimizers import FusedAdam as Adam
ModuleNotFoundError: No module named 'apex'
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 256, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'ru-gpts/pretrain_gpt3.py', '--local_rank=0', '--train-data-path', 'train.list', '--max-files-per-process', '100', '--logging-dir=log', '--save', 'model', '--load-huggingface', 'sberbank-ai/rugpt3large_based_on_gpt2', '--save-interval', '1000', '--log-interval', '100', '--eval-interval', '1000', '--eval-iters', '100', '--model-parallel-size', '1', '--num-layers', '12', '--hidden-size', '768', '--num-attention-heads', '12', '--batch-size', '1', '--seq-length', '2048', '--max-position-embeddings', '2048', '--train-iters', '2000', '--resume-dataloader', '--distributed-backend', 'nccl', '--lr', '0.00015', '--lr-decay-style', 'cosine', '--lr-decay-iters', '3200', '--clip-grad', '0.5', '--warmup', '.004', '--fp16', '--checkpoint-activations', '--deepspeed-activation-checkpointing', '--deepspeed', '--deepspeed_config', 'ru-gpts/src/deepspeed_config/gpt3_small_2048.json']' returned non-zero exit status 1.
king-menin commented 2 years ago

Пожалуйста выполните команды

rm -rf /usr/local/cuda
ln -s /usr/local/cuda-10.1 /usr/local/cuda
!pip uninstall torch
%%bash
export LD_LIBRARY_PATH=/usr/lib/
!apt-get install clang-9 llvm-9 llvm-9-dev llvm-9-tools
!pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

перед установкой apex и дальше

%%writefile setup.sh

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
!sh setup.sh

Скоро обновим другие ноутбуки - это из-за того, что в colab по умолчанию стала cuda110