Closed turbosonics closed 2 months ago
That is strange. Could you copy paste whole error backtrace?
켬 2024. 9. 4. 위치 오전 4:23, turbosonics @.***> 작성:
I performed fine-tune training based on MP pretrained model using 2 GPUs for a week (our local cluster has 1 week wall time limit). Following is the input file parameters for the fine-tune training.
model: chemical_species: 'Auto' cutoff: 5.0 channel: 128 is_parity: False lmax: 2 num_convolution_layer: 5 irreps_manual: - "128x0e" - "128x0e+64x1e+32x2e" - "128x0e+64x1e+32x2e" - "128x0e+64x1e+32x2e" - "128x0e+64x1e+32x2e" - "128x0e" weight_nn_hidden_neurons: [64, 64] radial_basis: radial_basis_name: 'bessel' bessel_basis_num: 8 cutoff_function: cutoff_function_name: 'XPLOR' cutoff_on: 4.5 act_gate: {'e': 'silu', 'o': 'tanh'} act_scalar: {'e': 'silu', 'o': 'tanh'} conv_denominator: "avg_num_neigh" train_shift_scale: True train_denominator: True self_connection_type: 'linear' train: train_shuffle: True random_seed: 123 is_train_stress: True epoch: 5000 loss: 'Huber' loss_param: delta: 0.01 optimizer: 'adam' optim_param: lr: 0.002 scheduler: 'ReduceLROnPlateau' scheduler_param: factor: 0.5 patience: 100 # best_metric: TotalLoss force_loss_weight: 2.0 stress_loss_weight: 0.1 per_epoch: 10 error_record: - ['Energy', 'RMSE'] - ['Force', 'RMSE'] - ['Stress', 'RMSE'] - ['Energy', 'MAE'] - ['Force', 'MAE'] - ['Stress', 'MAE'] - ['Energy', 'Loss'] - ['Force', 'Loss'] - ['Stress', 'Loss'] - ['TotalLoss', 'None'] continue: reset_optimizer: True reset_scheduler: True reset_epoch: True checkpoint: '${PTMODEL_DIR}/${PTMODEL_NAME}' # Set True to use shift, scale, and avg_num_neigh from checkpoint (highly recommended) use_statistic_values_of_checkpoint: True data: data_shuffle: True batch_size: 6 data_divide_ratio: 0.2 data_format: 'ase' load_dataset_path: ['${GEOMETRY_DIR}/${GEOMETRY_TRAIN_NAME}']
In this case, the MP pretrained model for "checkpoint" option is {SevenNet directory}/sevenn/pretrained_potentials/SevenNet_0__11July2024/checkpoint_sevennet_0.pth file.
Now, the training job has been killed by the server due to 1 week wall-time limit, so I tried to restart (or continue) this training work by following input script:
model: chemical_species: 'Auto' cutoff: 5.0 channel: 128 is_parity: False lmax: 2 num_convolution_layer: 5 irreps_manual: - "128x0e" - "128x0e+64x1e+32x2e" - "128x0e+64x1e+32x2e" - "128x0e+64x1e+32x2e" - "128x0e+64x1e+32x2e" - "128x0e" weight_nn_hidden_neurons: [64, 64] radial_basis: radial_basis_name: 'bessel' bessel_basis_num: 8 cutoff_function: cutoff_function_name: 'XPLOR' cutoff_on: 4.5 act_gate: {'e': 'silu', 'o': 'tanh'} act_scalar: {'e': 'silu', 'o': 'tanh'} conv_denominator: "avg_num_neigh" train_shift_scale: True train_denominator: True self_connection_type: 'linear' train: train_shuffle: True random_seed: 123 is_train_stress: True epoch: 5000 loss: 'Huber' loss_param: delta: 0.01 optimizer: 'adam' optim_param: lr: 0.002 scheduler: 'ReduceLROnPlateau' scheduler_param: factor: 0.5 patience: 100 # best_metric: TotalLoss force_loss_weight: 2.0 stress_loss_weight: 0.1 per_epoch: 10 error_record: - ['Energy', 'RMSE'] - ['Force', 'RMSE'] - ['Stress', 'RMSE'] - ['Energy', 'MAE'] - ['Force', 'MAE'] - ['Stress', 'MAE'] - ['Energy', 'Loss'] - ['Force', 'Loss'] - ['Stress', 'Loss'] - ['TotalLoss', 'None'] continue: reset_optimizer: False reset_scheduler: False reset_epoch: False checkpoint: '${PTMODEL_DIR}/${PTMODEL_NAME}' # Set True to use shift, scale, and avg_num_neigh from checkpoint (highly recommended) use_statistic_values_of_checkpoint: True data: data_shuffle: True batch_size: 6 data_divide_ratio: 0.2 data_format: 'ase' load_dataset_path: ['${GEOMETRY_DIR}/${GEOMETRY_TRAIN_NAME}']
In this case, the geometry files are the same, but I used {First_fine_tune_directory}/checkpoint_best.pth for the "checkpoint" option. Then I changed three options, reset_optimizer, reset_scheduler, and reset_epoch from True to False. Then I submitted the job using the exact same Slurm condition.
Then, I faced "torch.distributed.elastic.multiprocessing.errors.ChildFailedError"
I can't understand this. The first training job to fine-tune the MP pretrained model workred with 2 GPUs. But then, how and why restart/continue job fails with almost same input parameter setting and exactly the same Slrum setting? Am I doing something wrong during restart/continuation of fine-tune training job?
I compiled SevenNet with Cuda11.8 and prebuilt versrion of Pytorch 2.3.0 (which we can download from https://pytorch.org/get-started/locally/) and python 3.9 to virtual environment.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Here is the full error file contents printed out by our server cluster
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188] Starting elastic_operator with launch configs:
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188] entrypoint : sevenn
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188] min_nodes : 2
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188] max_nodes : 2
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188] nproc_per_node : 1
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188] run_id : 28673
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188] rdzv_backend : c10d
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188] rdzv_endpoint : 10.181.132.133
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188] rdzv_configs : {'timeout': 900}
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188] Starting elastic_operator with launch configs:
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188] entrypoint : sevenn
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188] min_nodes : 2
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188] max_nodes : 2
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188] nproc_per_node : 1
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188] run_id : 28673
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188] rdzv_backend : c10d
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188] rdzv_endpoint : 10.181.132.133
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188] rdzv_configs : {'timeout': 900}
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188] max_restarts : 0
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188] monitor_interval : 5
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188] log_dir : /tmp/user1/torchelastic_ru2y787f
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188] metrics_cfg : {}
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188]
I0903 16:36:04.244791 23456247946112 torch/distributed/elastic/agent/server/api.py:866] [default] starting workers for entrypoint: sevenn
I0903 16:36:04.244893 23456247946112 torch/distributed/elastic/agent/server/api.py:699] [default] Rendezvous'ing worker group
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188] max_restarts : 0
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188] monitor_interval : 5
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188] log_dir : /tmp/user1/torchelastic_y3po10lw
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188] metrics_cfg : {}
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188]
I0903 16:36:05.244767 23456247946112 torch/distributed/elastic/agent/server/api.py:866] [default] starting workers for entrypoint: sevenn
I0903 16:36:05.245002 23456247946112 torch/distributed/elastic/agent/server/api.py:699] [default] Rendezvous'ing worker group
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568] [default] Rendezvous complete for workers. Result:
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568] restart_count=0
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568] master_addr=gpunode123.gpuclstr.hpc.net
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568] master_port=46677
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568] group_rank=0
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568] group_world_size=2
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568] local_ranks=[0]
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568] role_ranks=[0]
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568] global_ranks=[0]
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568] role_world_sizes=[2]
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568] global_world_sizes=[2]
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568]
I0903 16:36:06.451587 23456247946112 torch/distributed/elastic/agent/server/api.py:707] [default] Starting worker group
I0903 16:36:06.451759 23456247946112 torch/distributed/elastic/agent/server/local_elastic_agent.py:168] Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.
I0903 16:36:06.451946 23456247946112 torch/distributed/elastic/multiprocessing/api.py:263] log directory set to: /tmp/user1/torchelastic_ru2y787f/28673_4b88ipix
I0903 16:36:06.452133 23456247946112 torch/distributed/elastic/multiprocessing/api.py:358] Setting worker0 reply file to: /tmp/user1/torchelastic_ru2y787f/28673_4b88ipix/attempt_0/0/error.json
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568] [default] Rendezvous complete for workers. Result:
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568] restart_count=0
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568] master_addr=gpunode123.gpuclstr.hpc.net
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568] master_port=46677
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568] group_rank=1
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568] group_world_size=2
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568] local_ranks=[0]
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568] role_ranks=[1]
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568] global_ranks=[1]
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568] role_world_sizes=[2]
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568] global_world_sizes=[2]
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568]
I0903 16:36:06.455331 23456247946112 torch/distributed/elastic/agent/server/api.py:707] [default] Starting worker group
I0903 16:36:06.455811 23456247946112 torch/distributed/elastic/agent/server/local_elastic_agent.py:168] Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.
I0903 16:36:06.455995 23456247946112 torch/distributed/elastic/multiprocessing/api.py:263] log directory set to: /tmp/user1/torchelastic_y3po10lw/28673_mdw_u3mo
I0903 16:36:06.456144 23456247946112 torch/distributed/elastic/multiprocessing/api.py:358] Setting worker0 reply file to: /tmp/user1/torchelastic_y3po10lw/28673_mdw_u3mo/attempt_0/0/error.json
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/bin/sevenn", line 8, in <module>
[rank0]: sys.exit(main())
[rank0]: File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/sevenn/main/sevenn.py", line 85, in main
[rank0]: train(global_config, working_dir)
[rank0]: File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/sevenn/scripts/train.py", line 57, in train
[rank0]: state_dicts, start_epoch, init_csv = processing_continue(config)
[rank0]: File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/sevenn/scripts/processing_continue.py", line 75, in processing_continue
[rank0]: check_config_compatible(config, config_cp)
[rank0]: File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/sevenn/scripts/processing_continue.py", line 49, in check_config_compatible
[rank0]: raise ValueError(
[rank0]: ValueError: reset optimizer and scheduler if you want to change trainable configs
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/bin/sevenn", line 8, in <module>
[rank1]: sys.exit(main())
[rank1]: File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/sevenn/main/sevenn.py", line 85, in main
[rank1]: train(global_config, working_dir)
[rank1]: File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/sevenn/scripts/train.py", line 57, in train
[rank1]: state_dicts, start_epoch, init_csv = processing_continue(config)
[rank1]: File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/sevenn/scripts/processing_continue.py", line 75, in processing_continue
[rank1]: check_config_compatible(config, config_cp)
[rank1]: File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/sevenn/scripts/processing_continue.py", line 49, in check_config_compatible
[rank1]: raise ValueError(
[rank1]: ValueError: reset optimizer and scheduler if you want to change trainable configs
E0903 16:36:16.466846 23456247946112 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 891499) of binary: sevenn
E0903 16:36:16.470759 23456247946112 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 910583) of binary: sevenn
I0903 16:36:16.473012 23456247946112 torch/distributed/elastic/multiprocessing/errors/__init__.py:360] ('local_rank %s FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html', 0)
Traceback (most recent call last):
File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sevenn FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-03_16:36:16
host : gpunode123.gpuclstr.hpc.net
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 891499)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
I0903 16:36:16.476574 23456247946112 torch/distributed/elastic/multiprocessing/errors/__init__.py:360] ('local_rank %s FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html', 1)
Traceback (most recent call last):
File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sevenn FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-03_16:36:16
host : gpunode127.gpuclstr.hpc.net
rank : 1 (local_rank: 0)
exitcode : 1 (pid: 910583)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: gpunode127: task 1: Exited with exit code 1
srun: error: gpunode123: task 0: Exited with exit code 1
Geometry I used is SevenNet graph data file. Geometry is originally calculated from VASP AIMD. I changed it to extxyz, then I converted it to SevenNet graph to use in this training.
I tried 10000 images and 20000 images cases, both of them showed the same symptom: The first fine-tune training (based on MP pretrained model) runs well with 2 GPUs for a week. But the restart/continuing second fine-tune training crashes with the same Child Failed error.
The current SevenNet that I compiled from virtual environment is July 23 2024 version. Are there any updated version since then?
Also, do you think custom built pytorch, instead of prebuilt pytorch, would be helpful to crack this error?
It is bug in SevenNet. The below part
[rank0]: File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/sevenn/scripts/processing_continue.py", line 49, in check_config_compatible [rank0]: raise ValueError( [rank0]: ValueError: reset optimizer and scheduler if you want to change trainable configs
should not raise an error as you didn't change your trainable configs (train_denominator, train_shift_scale).
I will prepare fix for this case. Sorry for wasting your time. I'll let you know if there is quick bypass to this problem, preserving your intent.
If you don't mind, you may set reset_optimizer=True
and reset_scheduler=True
to force the training restart.
Just updated the main branch fixing this issue. It seems good on my machine.
If you're not familiar with git, you can update SevenNet like
cd {SevenNet folder}
git fetch
git pull
pip install .
Then you should able to continue your training.
The bad news is that: the config train_denominator=True
was ignored, silently. It will work as it intended from now. However, from my experiences, fully converged potential gives almost the same accuracy regardless of train_denominator
or train_shift_scale
. Sorry for the bug anyway.
Updated the SevenNet, then changed reset_optimizer and reset_scheduler as True for restart/continuation job, then submitted. The retsart/continuation Job runs fine so far. Thanks! I will report again if something comes up again.
I performed fine-tune training based on MP pretrained model using 2 GPUs for a week (our local cluster has 1 week wall time limit). Following is the input file parameters for the fine-tune training.
In this case, the MP pretrained model for "checkpoint" option is {SevenNet_directory}/sevenn/pretrained_potentials/SevenNet_0__11July2024/checkpoint_sevennet_0.pth file.
Now, the training job has been killed by the server due to 1 week wall-time limit, so I tried to restart (or continue) this training work by following input script:
In this case, the geometry files is the same one with the original first fine-tuning training, but I used {First_fine_tune_directory}/checkpoint_best.pth for the "checkpoint" option this time because it is continuation of the first fine-tuning training.
Then I changed three options, reset_optimizer, reset_scheduler, and reset_epoch from True to False. Then I submitted the job using the exact same Slurm condition.
Command to execute the SevenNet is the same for original first fine-tuning job and continuing fine-tuning job:
srun torchrun --nnodes 2 --nproc_per_node 1 --rdzv_endpoint=$head_node_ip --rdzv_id=$RANDOM --rdzv_backend=c10d --no_python sevenn input.yaml -d
I also loaded same modules and performed from the same virtual environment for the first fine-tuning training job and the continuing fine-tuning training job:
I used the same commands to escape rendezvous crash for two jobs:
Then, I faced "torch.distributed.elastic.multiprocessing.errors.ChildFailedError" Actual error file is too long to copy here...
I can't understand this. The first training job to fine-tune the MP pretrained model workred with 2 GPUs. But then, how and why restart/continue job fails with almost same input parameter setting and exactly the same Slrum setting? Am I doing something wrong to the input file for this restart/continuation of fine-tune training job?
CPU memory cannot be an issue because I requested 496GB (whole memory of node) for two GPU nodes. Since the original first fine-tune training operated normally, this shouldn't be an issue of GPU memory as well.
Should I need to use the same GPU node that I used for the first fine-tune training for the continuing fine-tune training? But I just tried the same GPU nodes with the first fine tuning training, still printed out the same Child Faild error crash.
I compiled SevenNet with Cuda11.8 and prebuilt versrion of Pytorch 2.3.0 (which we can download from https://pytorch.org/get-started/locally/) and python 3.9 to virtual environment.