NWC-CUAHSI-Summer-Institute / LGAR-py

LGAR in python/torch
MIT License
4 stars 0 forks source link

Added Distributed Data Parallel to dpLGAR #15

Closed taddyb closed 1 year ago

taddyb commented 1 year ago

What was done:

Tagged issues:

Steps to run the code:

Terminal:

torchrun --nproc_per_node=1 --master_port=47800 dpLGAR/__main__.py ++nproc=1 ++save_name=debug_inplace

Pycharm:

script path: /home/tkb5476/anaconda3/pkgs/pytorch-1.13.1-gpu_cuda113py38hde3f150_1/bin/torchrun
parameters: --nproc_per_node=2 --master_port=47760 /mnt/sdb1/tkb5476/dpLGAR/dpLGAR/__main__.py ++nproc=2
taddyb commented 1 year ago

The code runs, but doesn't have any runoff. I'll have to dive into time periods that we can actually train on

image
taddyb commented 1 year ago

The in-place operation was caused by the error_check() function. Torch believed that checking if result was NaN modified the tensor in place, so it threw an error.

deleted the function, and now DDP is working

taddyb commented 1 year ago

Single process is working too. Merging this code to main