Enable pre-training of up-causal unet

bottlenome commented 3 months ago

[x] Make an account on GitHub and write a comment on this ticket.
[x] Set up Linux environment with GPU (Ubuntu 22.0 is recommended)
[x] Install the git command in linux environment and clone this repository to your local machine.
[x] Install the docker command with nvidia-gpu environment in linux environment and build the docker image.
[x] Execute some training(scripts/train*) on your machine by using docker_run.sh command.
[ ] By referring below document understand what kind of training data and script are needed. https://huggingface.co/blog/pretraining-bert
[ ] Make dataloader at bracket_net/data/lm.py and write test in that script.
[ ] Add test execution code at test/unit_test.sh and make pull request for the above change.
[ ] Make model module at bracket_net/domain/lm/up_causal_unet.py(with pull request)
[ ] Make train script at scripts/train_lm.py(with pull request)
[ ] Try training.

mizusawaok commented 3 months ago

前の手順ではWindows11の操作と異なり、しかたなく再度ChatGPTに質問しました。 User Windows11のマシンに以下の作業を行う手順を教えてください。「GPU および Linux 環境をセットアップする (Ubuntu 22.0 を推奨)」

code_teacher Windows 11 マシンに対して、GPU および Linux 環境（推奨：Ubuntu 22.04）をセットアップする手順を以下に示します。このガイドは主にWindows Subsystem for Linux (WSL) 2を使用します。

以下の手順に従ってください。

WSL 2のインストールとセットアップ Windows の設定を開く

スタートメニューから設定をクリック。 “プログラムと機能” メニューに移動

アプリ > オプション機能を選択。必要な機能を有効化

下記の機能を有効にします。仮想マシンプラットフォーム Windows サブシステム for Linux 上記のチェックボックスをオンにして、OK をクリックします。その後、PCを再起動します。 WSL2 の初期化

コマンドプロンプト(CMD) または PowerShell を管理者権限で開きます。そして以下のコマンドを実行します： wsl --install

Ubuntu 22.04 のインストール Microsoft ストアを開く

マイクロソフトストアを開き、Ubuntu を検索します。 Ubuntu 22.04 LTS を選び、インストールボタンをクリックしてインストールします。 Ubuntu 初期設定

インストールが完了したら、スタートメニューから Ubuntu を起動します。初回起動時に設定するユーザー名とパスワードを入力します。

GPUドライバーのインストール NVIDIAドライバーのインストール

Windows 用の最新の NVIDIA ドライバーをダウンロードしてインストールします。 NVIDIA Driver Downloads でお使いのGPUに対応したドライバーをダウンロードしてください。インストール後、PCを再起動します。 CUDA ツールキットのインストール

anaconda や pip で管理されている仮想環境を使用するのが一般的です。 WSL内で以下のコマンドを使用してCUDAツールをインストールします： wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600 wget https://developer.download.nvidia.com/compute/cuda/11.5.1/local_installers/cuda-repo-ubuntu2004-11-5-local_11.5.1-495.29.05-1_amd64.deb sudo dpkg -i cuda-repo-ubuntu2004-11-5-local_11.5.1-495.29.05-1_amd64.deb sudo apt-key add /var/cuda-repo-ubuntu2004-11-5-local/7fa2af80.pub sudo apt-get update sudo apt-get -y install cuda

cuDNN (CUDA Deep Neural Network Library) のインストール cuDNN のダウンロード

NVIDIA の公式サイトからcuDNNをダウンロードします(登録が必要)。 cuDNN Download cuDNN のインストール

ダウンロードした .tgz ファイルを解凍し、以下のようにコピーします： tar -xzvf cudnn-x.x-linux-x64-vx.x.tgz sudo cp cuda/include/cudnn.h /usr/local/cuda/include sudo cp cuda/lib64/libcudnn /usr/local/cuda/lib64 sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn

TensorFlow/PyTorch のインストール pip を使ってTensorFlowやPyTorchをインストールします。以下は、PyTorchの例です。 pip install torch torchvision torchaudio
GPUの動作確認以下のコマンドでGPUが認識されているか確認します。 nvidia-smi すべてのセットアップが完了したら、以下のPythonスクリプトを実行してGPUが正しく動作していることを確認します。

GPUの確認に使用するPythonスクリプト import torch print(torch.cuda.is_available()) これによって True が出力された場合、セットアップが正常に完了しています。

以上が、Windows 11 マシンでGPUとUbuntu 22.04をセットアップするための手順です。

Playground messages can be viewed by anyone at your organization using the API.

mizusawaok commented 3 months ago

ChatGPTの回答の内、以下の作業を完了しました。コマンドプロンプト(CMD) または PowerShell を管理者権限で開きます。そして以下のコマンドを実行します： wsl --install

Ubuntu 22.04 をインストールする Microsoft ストアを開く

マイクロソフトストアを起動し、Ubuntu を検索します。 Ubuntu 22.04 LTS のバージョン、インストールをインストールします。 Ubuntu の初期設定

インストールが完了したら、スタートメニューから Ubuntu を起動します。初回起動時にユーザー名とパスワードを設定します。

GPU ドライバーのインストール NVIDIA ドライバーのインストール

Windows用の最新のNVIDIA ドライバーをダウンロードしてインストールします。 NVIDIA Driver Downloadsでお使いのGPUに対応したドライバーをダウンロードしてください。インストール後、PCを再起動します。

mizusawaok commented 3 months ago

またChatGPTに質問しました。

User Windows11にUbuntuをインストールしました。次の作業手順を教えてください。「Linux 環境に git コマンドをインストールし、このリポジトリをローカルマシンにクローンします。」

code_teacher Windows 11にUbuntuをインストールした後、Linux環境でgitコマンドをインストールし、リポジトリをクローンする手順は次の通りです。

手順 Ubuntuターミナルを開く

"Windowsキー" を押してメニューを開き、「Ubuntu」アプリを検索して実行します。必要なアップデートをインストールする

sudo apt update sudo apt upgrade -y gitをインストールする

sudo apt install git -y インストールが成功したことを確認する

git --version 正常にインストールされている場合、インストールされているgitのバージョンが表示されます。 GitHubのユーザー設定（オプショナル）

gitでコミットを行うために必要な設定です。ユーザー名を設定します： git config --global user.name "Your Name" メールアドレスを設定します： git config --global user.email "youremail@example.com" 作業するディレクトリに移動

cd コマンドを使用して、リポジトリをクローンしたいディレクトリに移動します。例： cd ~/workspace リポジトリをクローンする

リポジトリのURLを取得し、以下のコマンドを入力します： git clone https://github.com/ユーザー名/リポジトリ名.git 例： git clone https://github.com/exampleuser/example-repo.git リポジトリのディレクトリに入る

リポジトリのディレクトリに移動します： cd リポジトリ名例： cd example-repo 以上で、Linux環境にGitをインストールし、指定されたリポジトリをローカルマシンにクローンできます。

mizusawaok commented 3 months ago

bracket_netをローカルマシンにクローンしました。

bottlenome commented 3 months ago

おそらく手元のマシンにはnvidia-gpuはついていないのではと想像しますがどうでしょうか? そうだったら私の管理するマシンに接続する方法をお教えします．

bottlenome commented 3 months ago

良さそうです．nvidia-smiも動いていたということですよね?

mizusawaok commented 3 months ago

Dockerは正常にできたようです。 stemctl status dockermizu@hp:~/workspace/bracket_net$ sudo systemctl status docker ● docker.service - Docker Application Container Engine Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor pre> Active: active (running) since Fri 2024-06-14 15:53:17 JST; 17h ago TriggeredBy: ● docker.socket Docs: https://docs.docker.com Main PID: 454 (dockerd) Tasks: 16 Memory: 61.5M CGroup: /system.slice/docker.service └─454 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/co>

Jun 14 15:53:17 hp dockerd[454]: time="2024-06-14T15:53:17.713122814+09:00" > Jun 14 15:53:17 hp dockerd[454]: time="2024-06-14T15:53:17.742634984+09:00" > Jun 14 15:53:17 hp dockerd[454]: time="2024-06-14T15:53:17.757954861+09:00" > Jun 14 15:53:17 hp dockerd[454]: time="2024-06-14T15:53:17.757982756+09:00" > Jun 14 15:53:17 hp dockerd[454]: time="2024-06-14T15:53:17.757997306+09:00" > Jun 14 15:53:17 hp dockerd[454]: time="2024-06-14T15:53:17.758000520+09:00" > Jun 14 15:53:17 hp dockerd[454]: time="2024-06-14T15:53:17.758012398+09:00" > Jun 14 15:53:17 hp dockerd[454]: time="2024-06-14T15:53:17.758248754+09:00" > Jun 14 15:53:17 hp dockerd[454]: time="2024-06-14T15:53:17.859422479+09:00" > Jun 14 15:53:17 hp systemd[1]: Started Docker Application Container Engine.

mizu@hp:~/workspace/bracket_net$ docker --version Docker version 26.1.4, build 5650f9b

bottlenome commented 3 months ago

以下はnvidia-smiが正常に動作している、ということだと思います。 mizu@hp:~/workspace/bracket_net$ nvidia-smi ... Segmentation fault

Segmentation faultなので正常に動作してませんね．

bottlenome commented 3 months ago

まずwslの環境でnvidia-smiが正常に動作するように持っていきましょう https://zenn.dev/utahka/articles/ed881a568246f4

bottlenome commented 3 months ago

docker環境には入れた、publicレポジトリにあるものをとってきた

dockerのビルドはできていない nvidia-smiは動いていない

という状況のようですね

bottlenome commented 3 months ago

今の状態だとgpuを使った学習はできないでしょうが、下記は実行して学習できるので試してみてください。 scripts/train.py

bottlenome commented 3 months ago

./run_docker.sh を実行しdockerコンテナ内部に入り python scripts/train.py を実行してください。

mizusawaok commented 3 months ago

cuDNNのサンプルが動いたかも・・・ j@hp:~/cuda-samples/Samples/4_CUDA_Libraries$ cd matrixMulCUBLAS j@hp:~/cuda-samples/Samples/4_CUDA_Libraries/matrixMulCUBLAS$ make /usr/local/cuda/bin/nvcc -ccbin g++ -I../../../Common -m64 --threads 0 --std=c++11 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -o matrixMulCUBLAS.o -c matrixMulCUBLAS.cpp /usr/local/cuda/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -o matrixMulCUBLAS matrixMulCUBLAS.o -lcublas mkdir -p ../../../bin/x86_64/linux/release cp matrixMulCUBLAS ../../../bin/x86_64/linux/release j@hp:~/cuda-samples/Samples/4_CUDA_Libraries/matrixMulCUBLAS$ ./matrixMulCUBLAS [Matrix Multiply CUBLAS] - Starting... GPU Device 0: "Ampere" with compute capability 8.6

GPU Device 0: "NVIDIA RTX A500 Laptop GPU" with compute capability 8.6

MatrixA(640,480), MatrixB(480,320), MatrixC(640,320) Computing result using CUBLAS...done. Performance= 2283.00 GFlop/s, Time= 0.086 msec, Size= 196608000 Ops Computing result using host CPU...done. Comparing CUBLAS Matrix Multiply with CPU results: PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

mizusawaok commented 3 months ago

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ j@hp:/mnt/c/windows/system32$ cd ~

bottlenome commented 3 months ago

恐らく今scripts.trainのデフォルトが壊れているので下記のオプションを入れて試してみてください．メモリエラーが出たらbatch_sizeを減らしてみてください．

python -m scripts.train model=up-causual-naive params.batch_size=25 params.num_epochs=50

mizusawaok commented 3 months ago

試しました。その１。何回でも試します。 @.**_docker:~/workspace/bracket_net$ python3 -m scripts.train model=up-causual-naive params.batch_size=25 params.num_epo chs=50 Traceback (most recent call last): File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/j/workspace/bracket_net/scripts/train.py", line 8, in from neural_astar.utils.data import MazeDataset File "/opt/conda/lib/python3.10/site-packages/neural_astar/utils/data.py", line 11, in from neural_astar.planner.differentiable_astar import AstarOutput File "/opt/conda/lib/python3.10/site-packages/neural_astar/planner/init.py", line 1, in from .astar import NeuralAstar, VanillaAstar File "/opt/conda/lib/python3.10/site-packages/neural_astar/planner/astar.py", line 12, in from . import encoder File "/opt/conda/lib/python3.10/site-packages/neural_astar/planner/encoder.py", line 7, in import segmentation_models_pytorch as smp File "/opt/conda/lib/python3.10/site-packages/segmentation_models_pytorch/init.py", line 2, in from . import encoders File "/opt/conda/lib/python3.10/site-packages/segmentation_models_pytorch/encoders/init.py", line 1, in import timm File "/opt/conda/lib/python3.10/site-packages/timm/init.py", line 3, in from .models import create_model, list_models, list_pretrained, is_model, list_modules, model_entrypoint, \ File "/opt/conda/lib/python3.10/site-packages/timm/models/init.py", line 1, in from .beit import File "/opt/conda/lib/python3.10/site-packages/timm/models/beit.py", line 49, in from timm.data import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD File "/opt/conda/lib/python3.10/site-packages/timm/data/init.py", line 5, in from .dataset import ImageDataset, IterableImageDataset, AugMixDataset File "/opt/conda/lib/python3.10/site-packages/timm/data/dataset.py", line 13, in from .readers import create_reader File "/opt/conda/lib/python3.10/site-packages/timm/data/readers/init.py", line 1, in from .reader_factory import create_reader File "/opt/conda/lib/python3.10/site-packages/timm/data/readers/reader_factory.py", line 3, in from .reader_image_folder import ReaderImageFolder File "/opt/conda/lib/python3.10/site-packages/timm/data/readers/reader_image_folder.py", line 11, in from timm.utils.misc import natural_key File "/opt/conda/lib/python3.10/site-packages/timm/utils/init.py", line 15, in from .summary import update_summary, get_outdir File "/opt/conda/lib/python3.10/site-packages/timm/utils/summary.py", line 9, in import wandb File "/opt/conda/lib/python3.10/site-packages/wandb/init.py", line 27, in from wandb import sdk as wandb_sdk File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/init.py", line 25, in from .artifacts.artifact import Artifact File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/artifacts/artifact.py", line 45, in from wandb import data_types, env, util File "/opt/conda/lib/python3.10/site-packages/wandb/data_types.py", line 32, in from .sdk.data_types import _dtypes File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/data_types/dtypes.py", line 393, in NumberType.types.append(np.float) File "/home/j/.local/lib/python3.10/site-packages/numpy/init.py", line 397, in getattr raise AttributeError( AttributeError: np.float_ was removed in the NumPy 2.0 release. Use np.float64 instead.. Did you mean: 'float16'?

2024年6月23日(日) 7:15 bottlenome @.***>:

恐らく今scripts.trainのデフォルトが壊れているので下記のオプションを入れて試してみてください．メモリエラーが出たらbatch_sizeを減らしてみてください．

python -m scripts.train model=up-causual-naive params.batch_size=25 params.num_epochs=50

— Reply to this email directly, view it on GitHub https://github.com/bottlenome/bracket_net/issues/4#issuecomment-2184212010, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWVWCTZJOBWBOXPODS5VQQLZIXZRHAVCNFSM6AAAAABJHMPCHWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBUGIYTEMBRGA . You are receiving this because you commented.Message ID: @.***>

bottlenome commented 3 months ago

サブモジュールがダウンロードできていないようなので下記コマンドでレポジトリ更新後、再度コマンドを実行してみてください

git submodule update --recursive

mizusawaok commented 3 months ago

試しました。その２．

@.:~/workspace/bracket_net$ git submodule update --recursive @.:~/workspace/bracket_net$ ls Dockerfile LICENSE bracket_net data docker2_run.sh docker_run.sh external scripts test @.:~/workspace/bracket_net$ ./docker2_run.sh @._docker:~/workspace/bracket_net$ python3 -m scripts.train model=up-causual-naive params.batch_size=25 params.num_epochs=50 Traceback (most recent call last): File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/j/workspace/bracket_net/scripts/train.py", line 8, in from neural_astar.utils.data import MazeDataset File "/opt/conda/lib/python3.10/site-packages/neural_astar/utils/data.py", line 11, in from neural_astar.planner.differentiable_astar import AstarOutput File "/opt/conda/lib/python3.10/site-packages/neural_astar/planner/init.py", line 1, in from .astar import NeuralAstar, VanillaAstar File "/opt/conda/lib/python3.10/site-packages/neural_astar/planner/astar.py", line 12, in from . import encoder File "/opt/conda/lib/python3.10/site-packages/neural_astar/planner/encoder.py", line 7, in import segmentation_models_pytorch as smp File "/opt/conda/lib/python3.10/site-packages/segmentation_models_pytorch/init.py", line 2, in from . import encoders File "/opt/conda/lib/python3.10/site-packages/segmentation_models_pytorch/encoders/init.py", line 1, in import timm File "/opt/conda/lib/python3.10/site-packages/timm/init.py", line 3, in from .models import create_model, list_models, list_pretrained, is_model, list_modules, model_entrypoint, \ File "/opt/conda/lib/python3.10/site-packages/timm/models/init.py", line 1, in from .beit import * File "/opt/conda/lib/python3.10/site-packages/timm/models/beit.py", line 49, in from timm.data import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD File "/opt/conda/lib/python3.10/site-packages/timm/data/init.py", line 5, in from .dataset import ImageDataset, IterableImageDataset, AugMixDataset File "/opt/conda/lib/python3.10/site-packages/timm/data/dataset.py", line 13, in from .readers import create_reader File "/opt/conda/lib/python3.10/site-packages/timm/data/readers/init.py", line 1, in from .reader_factory import create_reader File "/opt/conda/lib/python3.10/site-packages/timm/data/readers/reader_factory.py", line 3, in from .reader_image_folder import ReaderImageFolder File "/opt/conda/lib/python3.10/site-packages/timm/data/readers/reader_image_folder.py", line 11, in from timm.utils.misc import natural_key File "/opt/conda/lib/python3.10/site-packages/timm/utils/init.py", line 15, in from .summary import update_summary, get_outdir File "/opt/conda/lib/python3.10/site-packages/timm/utils/summary.py", line 9, in import wandb File "/opt/conda/lib/python3.10/site-packages/wandb/init.py", line 27, in from wandb import sdk as wandb_sdk File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/init.py", line 25, in from .artifacts.artifact import Artifact File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/artifacts/artifact.py", line 45, in from wandb import data_types, env, util File "/opt/conda/lib/python3.10/site-packages/wandb/data_types.py", line 32, in from .sdk.data_types import _dtypes File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/data_types/dtypes.py", line 393, in NumberType.types.append(np.float) File "/home/j/.local/lib/python3.10/site-packages/numpy/init.py", line 397, in getattr raise AttributeError( AttributeError: np.float_ was removed in the NumPy 2.0 release. Use np.float64 instead.. Did you mean: 'float16'?

2024年6月23日(日) 8:54 bottlenome @.***>:

サブモジュールがダウンロードできていないようなので下記コマンドでレポジトリ更新後、再度コマンドを実行してみてください

git submodule update --recursive

— Reply to this email directly, view it on GitHub https://github.com/bottlenome/bracket_net/issues/4#issuecomment-2184253857, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWVWCT2EJU6T2HFNNQ57HO3ZIYFEBAVCNFSM6AAAAABJHMPCHWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBUGI2TGOBVG4 . You are receiving this because you commented.Message ID: @.***>

bottlenome commented 3 months ago

これは結構重症ですね．． dockerコンテナの中のnumpyのバージョンを下げれればうまくいくような気もしますが．．

mizusawaok commented 3 months ago

numpyを1.22.4で試しました。 @.:~/workspace/bracket_net/scripts$ pip install numpy==1.22.4 Defaulting to user installation because normal site-packages is not writeable Collecting numpy==1.22.4 Downloading numpy-1.22.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.8/16.8 MB 9.9 MB/s eta 0:00:00 Installing collected packages: numpy Attempting uninstall: numpy Found existing installation: numpy 1.21.0 Uninstalling numpy-1.21.0: Successfully uninstalled numpy-1.21.0 Successfully installed numpy-1.22.4 @.:~/workspace/bracket_net/scripts$ cd .. @.:~/workspace/bracket_net$ import numpy as np @.:~/workspace/bracket_net$ ls Dockerfile bracket_net docker2_run.sh external test LICENSE data docker_run.sh scripts @.:~/workspace/bracket_net$ ./docker2_run.sh @._docker:~/workspace/bracket_net$ python3 scripts/train.py Traceback (most recent call last): File "/home/j/workspace/bracket_net/scripts/train.py", line 20, in

import bracket_net.domain.planning.gpt as gpt ModuleNotFoundError: No module named 'bracket_net' ***@***.***_docker:~/workspace/bracket_net$ exit 2024年6月24日(月) 10:51 bottlenome ***@***.***>: > これは結構重症ですね．． > dockerコンテナの中のnumpyのバージョンを下げれればうまくいくような気もしますが．． > > — > Reply to this email directly, view it on GitHub > , > or unsubscribe > > . > You are receiving this because you commented.Message ID: > ***@***.***> >

bottlenome commented 3 months ago

モジュールになっているので python3 -m scripts.train での実行をお願いします．

mizusawaok commented 3 months ago

実行してみました。 @.***_docker:~/workspace/bracket_net$ python3 -m scripts.train /home/j/workspace/bracket_net/scripts/train.py:89: UserWarning: The version_base parameter is not specified. Please specify a compatability version level, or None. Will assume defaults for version 1.1 @hydra.main(config_path="config", config_name="train") /opt/conda/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'train': Defaults list is missing _self_. See https://hydra.cc/docs/upgrades/1.0_to_1.1/default_composition_order for more information warnings.warn(msg, UserWarning) Error executing job with overrides: [] Traceback (most recent call last): File "/home/j/workspace/bracket_net/scripts/train.py", line 94, in main train_loader = create_dataloader( File "/home/j/workspace/bracket_net/scripts/train.py", line 40, in create_dataloader dataset = AugumentedMazeDataset(filename, split, num_starts, magnification, size_max) File "/home/j/workspace/bracket_net/scripts/train.py", line 51, in init self.dataset = MazeDataset(filename, split, num_starts=num_starts) File "/opt/conda/lib/python3.10/site-packages/neural_astar/utils/data.py", line 122, in init ) = self._process(filename) File "/opt/conda/lib/python3.10/site-packages/neural_astar/utils/data.py", line 128, in _process with np.load(filename) as f: File "/home/j/.local/lib/python3.10/site-packages/numpy/lib/npyio.py", line 407, in load fid = stack.enter_context(open(os_fspath(file), "rb")) FileNotFoundError: [Errno 2] No such file or directory: 'data/planning-datasets/data/mpd/mazes_032_moore_c8.npz'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. @.***_docker:~/workspace/bracket_net$

2024年6月24日(月) 17:28 bottlenome @.***>:

モジュールになっているので python3 -m scripts.train での実行をお願いします．

— Reply to this email directly, view it on GitHub https://github.com/bottlenome/bracket_net/issues/4#issuecomment-2185912412, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWVWCT2UDCYZFR6M5QVE6ALZI7KDFAVCNFSM6AAAAABJHMPCHWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBVHEYTENBRGI . You are receiving this because you commented.Message ID: @.***>

bottlenome commented 3 months ago

やっぱりsubmodule updateが成功していないので git submodule update --init --recursive を行なっていください．成功していたらdata/planning-datasetsの中身が入ります．

mizusawaok commented 3 months ago

実行結果です。23/06/30 j@hp:~/workspace/bracket_net$ git submodule update --init --recursive j@hp:~/workspace/bracket_net$ ls Dockerfile bracket_net docker2_run.sh external scripts LICENSE data docker_run.sh outputs test j@hp:~/workspace/bracket_net$ ./docker2_run.sh j@bn_docker:~/workspace/bracket_net$ python3 -m scripts.train /home/j/workspace/bracket_net/scripts/train.py:89: UserWarning: The version_base parameter is not specified. Please specify a compatability version level, or None. Will assume defaults for version 1.1 @hydra.main(config_path="config", config_name="train") /opt/conda/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'train': Defaults list is missing _self_. See https://hydra.cc/docs/upgrades/1.0_to_1.1/default_composition_order for more information warnings.warn(msg, UserWarning) Number of Train Samples: 800 Size: 32x32 Number of Validation Samples: 100 Size: 32x32 Number of Test Samples: 100 Size: 32x32 GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs You are using a CUDA device ('NVIDIA RTX A500 Laptop GPU') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision wandb: Currently logged in as: **** (****-). Use wandb login --relogin to force relogin wandb: Tracking run with wandb version 0.17.3 wandb: Run data is saved locally in ./wandb/run-20240629_005523-gp46sy1r wandb: Run wandb offline to turn off syncing. wandb: Syncing run neural_astar-CNN-4 wandb: ⭐️ View project at https://wandb.ai/****-%20Inc./planning wandb: 🚀 View run at https://wandb.ai/****-****%20Inc./planning/runs/gp46sy1r LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

| Name | Type | Params

0 | planner | NeuralAstar | 391 K 1 | vanilla_astar | VanillaAstar | 9

391 K Trainable params 18 Non-trainable params 391 K Total params 1.566 Total estimated model params size (MB) Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]/home/j/.local/lib/python3.10/site-packages/torch/functional.py:512: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3587.) return _VF.meshgrid(tensors, *kwargs) # type: ignore[attr-defined] Epoch 49: 100%|███████████████████████████████████████████████████████████████████| 80/80 [00:59<00:00, 1.36it/s, v_num=sy1r]Trainer.fit stopped: max_epochs=50 reached. Epoch 49: 100%|███████████████████████████████████████████████████████████████████| 80/80 [00:59<00:00, 1.36it/s, v_num=sy1r] Error executing job with overrides: [] Traceback (most recent call last): File "/home/j/workspace/bracket_net/scripts/train.py", line 188, in main trainer.test(module, test_loader) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 754, in test return call._call_and_handle_interrupt( File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 794, in _test_impl results = self._run(model, ckpt_path=ckpt_path) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 937, in _run _verify_loop_configurations(self) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/configuration_validator.py", line 42, in _verify_loop_configurations __verify_eval_loop_configuration(model, "test") File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/configuration_validator.py", line 110, in __verify_eval_loop_configuration raise MisconfigurationException(f"No {step_name}() method defined to run Trainer.{trainer_method}.") lightning_fabric.utilities.exceptions.MisconfigurationException: No test_step() method defined to run Trainer.test.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. wandb: / 1.544 MB of 1.544 MB uploaded wandb: Run history: wandb: epoch ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ wandb: metrics/h_mean ▁▃▇▇▄▅▅▆▆▄▇▇▄▅▆▅▅▄▆▄▆▅▅▆▇▅█▅▆▇▅▃▂▄▇█▅▆▅▇ wandb: metrics/p_exp ▁▅▅▅▃▆▇▅▆▄▇▇▅▃▄▄▅▅▅▄▇▆▆▆▆▄▆▃▆█▆▅▄▄▆▆▅▆▆▆ wandb: metrics/p_opt ▃▂█▇▅▃▃▇▅▅▅▅▃█▇▅▅▃▆▃▄▂▄▅▆▅▇▇▅▄▃▂▁▄▆▇▃▅▄▆ wandb: metrics/train_loss ▃▄▃▂▃▄▃▆▅▄▄▄▇▆▅▇▆▅█▅▇▅▆▂▅▆▄▄▂▃▃▅▆▃▅▅▄▅▂▁ wandb: metrics/val_loss ▃█▃▅▃▃▄▄▂█▆▆█▁▄▅▂█▆▆█▆▄▄▅▃▁▃▄▅▅█▄▃▅▅▅▄▃▆ wandb: trainer/global_step ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ wandb: wandb: Run summary: wandb: epoch 49 wandb: metrics/h_mean 0.55449 wandb: metrics/p_exp 0.4589 wandb: metrics/p_opt 0.702 wandb: metrics/train_loss 2.39557 wandb: metrics/val_loss 2.30378 wandb: trainer/global_step 3999 wandb: wandb: 🚀 View run neural_astar-CNN-4 at: https://wandb.ai/********-****%20Inc./planning/runs/gp46sy1r wandb: ⭐️ View project at: https://wandb.ai/********-****%20Inc./planning wandb: Synced 6 W&B file(s), 0 media file(s), 1 artifact file(s) and 0 other file(s) wandb: Find logs at: ./wandb/run-20240629_005523-gp46sy1r/logs wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with wandb.require("core")! See https://wandb.me/wandb-core for more information.

bottlenome commented 3 months ago

train.yaml 上記には実行のコンフィグ設定がいろいろあります．参考にしていろいろコンフィグを変えてみて．学習がサチる値を見つけてみてください．
model 上記にはさまざまなモデルのコンフィグがあります． model=<モデル名> でいろいろなモデルを試してみてください．
中には動かないモデルもあるかと思います． 3.1 ソースコードを修正し 3.2 gitの機能でfeature branchを作ってコミットし 3.3 git pushし 3.4 git hub上でpull requestを作ってみてください．

bottlenome commented 2 months ago

何かgit lfsで毎回ダウンロードしてないでしょうか？今月の帯域を使い切ってしまいました。

bottlenome commented 2 months ago

あんまり返事にはなってはいないです．例えばワークスペースを消して初めからやり直していたりすると上記の様なことになります． git lfs pull でダウンロードした数が帯域にカウントされます．

mizusawaok commented 2 months ago

＞例えばワークスペースを消して初めからやり直していたりすると上記の様なことにな＞ります．＞git lfs pull ＞でダウンロードした数が帯域にカウントされます． @.***:~/workspace/bracket_net$ ./docker2_run.sh は何回も行っています。これは原因になりますか？

2024年7月9日(火) 21:46 bottlenome @.***>:

あんまり返事にはなってはいないです．例えばワークスペースを消して初めからやり直していたりすると上記の様なことになります． git lfs pull でダウンロードした数が帯域にカウントされます．

— Reply to this email directly, view it on GitHub https://github.com/bottlenome/bracket_net/issues/4#issuecomment-2217617706, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWVWCT2SZ4Y7QX4BFSUMTOTZLPLQ5AVCNFSM6AAAAABJHMPCHWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJXGYYTONZQGY . You are receiving this because you commented.Message ID: @.***>

bottlenome commented 2 months ago

docker_run2.sh は私が作ってないファイルかと思うのでコメントできません。

とりあえず対処する場所が間違っているので現状はエラーの対処法を聞いている気がするのですが、まずはそれぞれのファイルやフォルダが何をやっているのか聞いて全体像を把握してください。

そしてその結果をここに貼って私に合っているか確認してみてください。

mizusawaok commented 2 months ago

https://github.com/bottlenome/bracket_net/issues/4#issuecomment-2199572751 現在Dockerの「絵とき入門」を勉強中です。 24/7/1の以下のコメントから再度やり直しを行います。

train.yaml 上記には実行のコンフィグ設定がいろいろあります．参考にしていろいろコンフィグを変えてみて．学習がサチる値を見つけてみてください． model 上記にはさまざまなモデルのコンフィグがあります． model=<モデル名> でいろいろなモデルを試してみてください．中には動かないモデルもあるかと思います． 3.1 ソースコードを修正し 3.2 gitの機能でfeature branchを作ってコミットし 3.3 git pushし 3.4 git hub上でpull requestを作ってみてください．

bottlenome / bracket_net

Enable pre-training of up-causal unet #4

| Name | Type | Params

0 | planner | NeuralAstar | 391 K 1 | vanilla_astar | VanillaAstar | 9