Added Distributed Data Parallel to dpLGAR

taddyb commented 1 year ago

What was done:

To run dpLGAR over a long period successfully, I added distributed data parallelism to split the model data over many CPU cores: https://pytorch.org/tutorials/distributed/home.html#learn-ddp
This should allow the code to run over many basins all at the same time
Created a separate agent to run the single basin case
Created a script to train all basins in parallel

Tagged issues:

Validation of dpLGAR: https://github.com/NWC-CUAHSI-Summer-Institute/dpLGAR/issues/8

Steps to run the code:

Terminal:

torchrun --nproc_per_node=1 --master_port=47800 dpLGAR/__main__.py ++nproc=1 ++save_name=debug_inplace

Pycharm:

script path: /home/tkb5476/anaconda3/pkgs/pytorch-1.13.1-gpu_cuda113py38hde3f150_1/bin/torchrun
parameters: --nproc_per_node=2 --master_port=47760 /mnt/sdb1/tkb5476/dpLGAR/dpLGAR/__main__.py ++nproc=2

taddyb commented 1 year ago

The code runs, but doesn't have any runoff. I'll have to dive into time periods that we can actually train on

taddyb commented 1 year ago

The in-place operation was caused by the error_check() function. Torch believed that checking if result was NaN modified the tensor in place, so it threw an error.

deleted the function, and now DDP is working

taddyb commented 1 year ago

Single process is working too. Merging this code to main

NWC-CUAHSI-Summer-Institute / LGAR-py