USGS-R / river-dl

Deep learning model for predicting environmental variables on river systems
Creative Commons Zero v1.0 Universal
21 stars 15 forks source link

Torch and GraphWaveNet Integration #163

Closed SimonTopp closed 2 years ago

SimonTopp commented 2 years ago

This PR contains updates to make River-dl more modular and capable of training/evaluating PyTorch models in addition to TensorFlow. Significant additions/changes include:

I've tested the updates locally for RGCN (TF and PyTorch) and GraphWaveNet, I still need to test on TG GPU though. I'll do this final testing asap, but should only impact environment.yml if anything

jdiaz4302 commented 2 years ago

Some notes/documentation for the PyTorch RGCNs...

Example minimum usage:

import torch
import numpy as np
from river_dl.RGCN_v1 import *

data = torch.rand([455, 365, 16])
A = np.random.normal(size = [455, 455]) 

model = RGCN_v1(input_dim = 16,
                hidden_dim = 20,
                adj_matrix = A,
                recur_dropout = 0, 
                dropout = 0)

out, (h, c) = model(data)

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

count_parameters(model)

RGCN_v1 (paper equations) and RGCN_v0 (river-dl tensorflow code) can be used interchangeably. When switching between the two and at a given hidden dimension size, you will use less parameters with RGCN_v1 (which will be displayed with the last line of code above). For that example, RGCN_v0 has 5461 parameters while v1 has 3401 which is a pretty big % change.

RGCN_ParamDiff

The example uses random data of size [455, 365, 16] to represent one batch from the river-dl data associated with the [455, 455] adjacency matrix.

I have these models coded to output the sequence of predictions along with the last h and c state. This can be easily modified to output the list of h and c states if that is preferred. Also, we can just output the sequence of predictions if that is more compatible with the GWN and you're not interested in states or DA.

Also, that code hasn't been tested with the pipeline. It's just a workflow agnostic model code that may need a change or two (if those changes are found/stated, I don't mind making them).

SimonTopp commented 2 years ago

EDIT:: Solved below issue, we need to point to additional paths using updated versions, thanks @janetrbarclay for putting me on the right path 😉

module load module load cuda11.3/toolkit/11.3.0
source activate rdl_torch_tf export LD_LIBRARY_PATH=/cm/shared/apps/nvidia/TensorRT-6.0.1.5/lib:/cm/shared/apps/nvidia/cudnn_8.0.5/lib64:/cm/local/apps/cuda/libs/current/lib64:$LD_LIBRARY_PATH

(Initial Post) Thanks for that @jdiaz4302! Also, I'm just testing everything on the TG GPUs. I was hoping that since TG added cuda 11.3 modules we could run everything with the current versions of TF and PyTorch. For the PyTorch pipelines, everything works as expected with Python 3.9, PyTorch 1.10, and cudatoolkit 11.3. But, when we try to train the TF models in the same environment with TF 2.7 (current version) we get errors about not being able to find the correct GPU libraries (see below). What do you think, is it important that we have a single environment for River-dl? Is it ok to have a PyTorch environment and then maintain the old TF 2.1/python 3.5/cudatoolkit 10.0 environment for TF runs? Also, just to clarify, the TF pipelines run fine on CPU in the new joint environment, just not on GPU.

2022-01-26 10:52:09.910551: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /cm/shared/apps/cuda11.3/toolkit/11.3.0/targets/x86_64-linux/lib:/cm/shared/apps/slurm/18.08.8/lib64/slurm:/cm/shared/apps/slurm/18.08.8/lib64:/cm/local/apps/gcc/8.2.0/lib:/cm/local/apps/gcc/8.2.0/lib64:/cm/local/apps/cuda/libs/current/lib64 2022-01-26 10:52:09.910667: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries.

janetrbarclay commented 2 years ago

FWIW, awhile back (July '21) I was having issues with cuda libraries on TG that were resolved by explicitly specifying the library path. In my case it involved adding export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/cm/local/apps/cuda/libs/current/lib64 to the slurm file after activating the conda environment. Not sure if that applies here or not.

SimonTopp commented 2 years ago

@janetrbarclay, yup! I've been doing that here, the run issues are with exporting path. I don't really understand why it works with the old environment setup and not the new one given that I think all the requirements are met. You're totally right that the error is very similar to what you get if you don't export the path though

SimonTopp commented 2 years ago

Ok, everything works on TG GPU. @jdiaz4302, fyi I updated the two PyTorch RGCN models so that you can return just the output, or the output and h and c states. This just helps them run with the model agnostic snakemake workflows. Other than that everything fits together nicely.

jdiaz4302 commented 2 years ago

Nice, glad it was pretty seamless! I'd be interested in any results with the smaller RGCN (e.g., results look the same as the bigger RGCN, it does/doesnt helps with the gw nan issue, etc)

I'm reviewing something for Jake this afternoon, but I think I can review this tomorrow

SimonTopp commented 2 years ago

@jsadler2, thanks for the feedback! At first glance I think I agree with all your suggestions. I'll spend some time tomorrow morning addressing them.

jdiaz4302 commented 2 years ago

Hey, Simon, great job! I think the PyTorch work all looks good, and thank you for adding those citations to the models; I went ahead and added some documentation for the dropout and recurrent dropout arguments.

I think you and Jeff have a better grasp on the whole pipeline, so I don't have much to add there, but I was wondering if there's any check/assurance that when you install torch that it is associated with the correct CUDA version? On PyTorch's main page, they provide you with a command for the proper installation when you specify your OS, CUDA, package manager, etc, but I don't see that command reflected here (see below). Could you maybe be using the default of pytorch for CUDA 10.2 which is supposedly given by pip3 install torch torchvision torchaudio (and maybe having slower code or loss of functionality?)

pytorch_install

SimonTopp commented 2 years ago

@jdiaz4302, I was worried about that as well since PyTorch doesn't give any guidance on using an environment yml, but I'm pretty sure that if you install using conda env create -f environment.yaml after loading the cuda modules then it all installs correctly. When I built the environment this morning it installed with the following: image image image

With that said, we should probably update the readme to specify loading modules and exporting the necessary cuda paths.

jdiaz4302 commented 2 years ago

Awesome, it makes sense that maybe some sort of inspection goes on to install the correct version, and agreed on the readme

SimonTopp commented 2 years ago

I'll go ahead and update the readme to include in this PR, we also need to update the docker image and docker readme, but given how big this PR is I think we should do that seperately.