0num4 commented 9 months ago

kanachan/training/bert/phase1/DockerfileはFROM cryolite/kanachanのpythonファイルを使っていてlocalのものが(pip install .しても)反映されないっぽいのでdockerfileにcopyコマンドを入れるなどする

FROM cryolite/kanachan
COPY /workspaces/kanachan-wsl/kanachan /workspace/kanachan

WORKDIR /workspace/data

# ENTRYPOINT ["torchrun", "--nproc_per_node", "gpu", "--standalone", "-m", "kanachan.training.bert.phase1.train"]
ENTRYPOINT ["torchrun", "--nproc_per_node", "gpu", "--standalone", "-m", "kanachan.training.bert.phase1.testa"]

root ➜ /workspaces/kanachan-wsl (majsoul-proto) $ pwd
/workspaces/kanachan-wsl
docker build --no-cache -f kanachan/training/bert/phase1/Dockerfile -t cryolite/kanachan.training.bert.phase1-1 .
docker run --gpus all -v bert-ph1:/workspace/data cryolite/kanachan.training.bert.phase1-1 --training-data=. --training-batch-size=1024 --output-prefix=nyaa --model-preset=base

まだ動かない

0num4 commented 9 months ago

localで動かした場合

root ➜ /workspaces/kanachan-wsl (majsoul-proto) $ torchrun --nproc_per_node gpu --standalone -m kanachan.training.bert.phase1.testa
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
helloworld
/workspaces/kanachan-wsl
Cannot find primary config 'config'. Check that it's in your config search path.

Config search path:
        provider=hydra, path=pkg://hydra.conf
        provider=schema, path=structured://

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 41223) of binary: /usr/local/python/current/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
kanachan.training.bert.phase1.testa FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-07_10:14:41
  host      : 34e0a286703b
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 41223)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
root ➜ /workspaces/kanachan-wsl (majsoul-proto) $

0num4 commented 9 months ago

localにあるやつと元のやつ全然違ってて泣きました(まずhydra使ってないやんけ)

0num4 commented 9 months ago

root ➜ /workspaces/kanachan-wsl (majsoul-proto) $ HYDRA_FULL_ERROR=1 torchrun --nproc_per_node gpu --standalone -m kanachan.training.bert.phase1.train
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
2024-01-07 11:08:58.175275: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-07 11:08:58.175336: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-07 11:08:58.176048: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-07 11:08:58.180764: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-07 11:08:58.859855: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/workspaces/kanachan-wsl
Error executing job with overrides: []
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspaces/kanachan-wsl/kanachan/training/bert/phase1/train.py", line 37, in <module>
    _main()  # pylint: disable=no-value-for-parameter
  File "/usr/local/lib/python3.10/dist-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/workspaces/kanachan-wsl/kanachan/training/bert/phase1/train.py", line 26, in _main
    training.main(
  File "/workspaces/kanachan-wsl/kanachan/training/bert/training.py", line 255, in main
    print(config.training_data)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/dictconfig.py", line 359, in __getattr__
    self._format_and_raise(key=key, value=None, cause=e)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/base.py", line 231, in _format_and_raise
    format_and_raise(
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/_utils.py", line 899, in format_and_raise
    _raise(ex, cause)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/_utils.py", line 797, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/dictconfig.py", line 351, in __getattr__
    return self._get_impl(
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/dictconfig.py", line 451, in _get_impl
    return self._resolve_with_default(
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/basecontainer.py", line 96, in _resolve_with_default
    raise MissingMandatoryValue("Missing mandatory value: $FULL_KEY")
omegaconf.errors.MissingMandatoryValue: Missing mandatory value: training_data
    full_key: training_data
    object_type=Config
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 51915) of binary: /usr/local/python/current/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
kanachan.training.bert.phase1.train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-07_11:09:01
  host      : 34e0a286703b
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 51915)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

一生ガチャガチャやってる

0num4 commented 9 months ago

ガチャガチャしたらなんか終わったっぽい雰囲気出してる

root ➜ /workspaces/kanachan-wsl (majsoul-proto) $ HYDRA_FULL_ERROR=1 torchrun --nproc_per_node gpu --standalone -m kanachan.training.bert.phase1.train training_data=./bert-ph1/annotated.txt training_batch_size=1024
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
2024-01-07 11:19:42.634169: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-07 11:19:42.634225: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-07 11:19:42.634916: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-07 11:19:42.639582: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-07 11:19:43.282917: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/workspaces/kanachan-wsl
bert-ph1/annotated.txt
[2024-01-07 11:19:44,260][root][INFO] - World size: 1
[2024-01-07 11:19:44,261][root][INFO] - Process rank: 0
[2024-01-07 11:19:44,261][root][INFO] - Training data: bert-ph1/annotated.txt
[2024-01-07 11:19:44,261][root][INFO] - Validation data: N/A
[2024-01-07 11:19:44,262][root][INFO] - # of workers: 2
[2024-01-07 11:19:44,262][root][INFO] - Device: cuda
[2024-01-07 11:19:44,262][root][INFO] - cuDNN: available
[2024-01-07 11:19:44,263][root][INFO] - dtype: torch.float32
[2024-01-07 11:19:44,263][root][INFO] - AMP dtype: torch.float16
[2024-01-07 11:19:44,263][root][INFO] - Position encoder: position_embedding
[2024-01-07 11:19:44,263][root][INFO] - Encoder dimension: 768
[2024-01-07 11:19:44,264][root][INFO] - # of heads for encoder: 12
[2024-01-07 11:19:44,264][root][INFO] - Dimension of feedforward networks for encoder: 3072
[2024-01-07 11:19:44,264][root][INFO] - Activation function for encoder: gelu
[2024-01-07 11:19:44,264][root][INFO] - Dropout for encoder: 0.100000
[2024-01-07 11:19:44,265][root][INFO] - # of encoder layers: 12
[2024-01-07 11:19:44,265][root][INFO] - Dimension of feedforward networks for decoder: 3072
[2024-01-07 11:19:44,265][root][INFO] - Activation function for decoder: gelu
[2024-01-07 11:19:44,266][root][INFO] - Dropout for decoder: 0.100000
[2024-01-07 11:19:44,266][root][INFO] - # of decoder layers: 2
[2024-01-07 11:19:44,266][root][INFO] - Checkpointing: False
[2024-01-07 11:19:44,266][root][INFO] - Local training batch size: 1024
[2024-01-07 11:19:44,267][root][INFO] - World training batch size: 1024
[2024-01-07 11:19:44,267][root][INFO] - # of steps for gradient accumulation: 1
[2024-01-07 11:19:44,267][root][INFO] - Virtual training batch size: 1024
[2024-01-07 11:19:44,268][root][INFO] - Norm threshold for gradient clipping: 1.000000E+00
[2024-01-07 11:19:44,268][root][INFO] - Optimizer: lamb
[2024-01-07 11:19:44,268][root][INFO] - Learning rate: 1.000000E-03
[2024-01-07 11:19:44,268][root][INFO] - Experiment output: /workspaces/kanachan-wsl/outputs/2024-01-07/11-19-44
[2024-01-07 11:19:44,269][root][INFO] - Snapshot interval: N/A
[2024-01-07 11:19:45,104][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2024-01-07 11:19:45,104][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
[2024-01-07 11:19:45,265][root][INFO] - A training epoch has finished (elapsed time = 0:00:00.091498).
root ➜ /workspaces/kanachan-wsl (majsoul-proto) $

0num4 commented 9 months ago

終わったな(確信)

0num4 / kanachan

bert-ph1を動かす #7

localで動かした場合