USGS-R / river-dl

Deep learning model for predicting environmental variables on river systems
Creative Commons Zero v1.0 Universal
21 stars 15 forks source link

Lambdas parameter for RGCN on HPC GPU #110

Closed SimonTopp closed 2 years ago

SimonTopp commented 3 years ago

https://github.com/USGS-R/river-dl/blob/9d2d340dd1f95bb09eb00a04cad5e73327fdb65e/Snakefile#L58

Currently config.yml defines lambdas as a list, but when calling train_model in the snakefile for HPC GPU it tries to pull a single "lamb" argument from config.yml. There are potentially problems changing the "lamb" in train_model to "lambdas" because the RGCN isn't expecting a list. It's also unclear if the lambdas argument is used in defining the loss function in train_model.py.

jsadler2 commented 3 years ago

Thanks Simon for making this issue.

lambdas should be used in the definition of the loss function that is used when the model is compiled on this line: https://github.com/USGS-R/river-dl/blob/9d2d340dd1f95bb09eb00a04cad5e73327fdb65e/river_dl/train_model.py#L94

So we need to just pass along that variable.

We will need to think about how lamdas which is a list is passed over the commandline. I think this will work: https://stackoverflow.com/a/43786290/5811195

SimonTopp commented 3 years ago

Ok, I think I understand. Although in looking at this deeper it seems like maybe there's a better way to do this. If I understand the code we're defining the loss function once in the at line 12 in the Snakefile https://github.com/USGS-R/river-dl/blob/9d2d340dd1f95bb09eb00a04cad5e73327fdb65e/Snakefile#L12 but then we don't pass that to the command line call for train_model, instead we re-define it on line 62. https://github.com/USGS-R/river-dl/blob/9d2d340dd1f95bb09eb00a04cad5e73327fdb65e/Snakefile#L62 This seems a little error prone if someone doesn't notice and only changes it in one place. Does it make more sense to change the upstream definition/pass it through to avoid confusion? Also, apologies if I'm not looking at this right.

janetrbarclay commented 3 years ago

A related question - I add 2 additional lambdas with the GW loss function. Should I add these to this list of lambdas or keep them as separate input terms?

jsadler2 commented 3 years ago

These are really good/tough questions.

@SimonTopp - yes you are exactly right that there are two places where the loss function is currently defined in the Snakefile. I like the idea of just having one place where it is defined as you suggested. I'm not sure how that would work, though.

The hang up is that commandline interface. That restricts us to passing a string as an argument. Whereas, currently, the train_model function https://github.com/USGS-R/river-dl/blob/9d2d340dd1f95bb09eb00a04cad5e73327fdb65e/river_dl/train.py#L23 takes as an argument loss_function which is a python function. So I see two possible modifications:

There may be some other alternative and if you think of one, feel free to suggest it.

jsadler2 commented 3 years ago

@janetrbarclay - a lot of it depends on if you are using CPU or GPU. If you are using CPU, you would just define the loss function (with the new lambdas) in your Snakefile and pass that function to the train_model as the loss_function argument.

If you are using GPU's on the other hand, you will need to pass everything you will need to define your loss function through the CLI. This is a bit tedious because then you have to create a corresponding commandline variable for each of those. I see you've already done that with your new lambdas https://github.com/USGS-R/river-dl/blob/7a92369e57e27a34ca6ff99fc4b216b6bc179874/river_dl/train_model.py#L62. If we stick with passing the loss_function as a function (not a string), you will also need to handle the temp_mean, temp_sd, gw_mean, gw_std arguments too, since the loss function has to be defined before calling train_model (the function - not to be confused with train_model.py)

janetrbarclay commented 3 years ago

I guess I was thinking more conceptually - do we want to have the GW lambdas (when used) appended to the existing list of lambdas or kept separate? Maybe that is better answered once the question of where / how to define the loss function is answered?


Janet Barclay U.S. Geological Survey New England Water Science Center Connecticut Office 101 Pitkin St. East Hartford, CT 06108

Phone (office) 860 291-6763 Fax 860 291-6799 Email @.**@*.**@*.***> https://www.usgs.gov/staff-profiles/janet-barclay


From: Jeff Sadler @.> Sent: Tuesday, June 15, 2021 2:54 PM To: USGS-R/river-dl @.> Cc: Barclay, Janet R @.>; Mention @.> Subject: [EXTERNAL] Re: [USGS-R/river-dl] Lambdas parameter for RGCN on HPC GPU (#110)

This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.

@janetrbarclayhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjanetrbarclay&data=04%7C01%7Cjbarclay%40usgs.gov%7C90a3beab5ebb44088f1208d9302f16d1%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637593801086554897%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=v78lLOH8R3PRkklsEXqnuUdfwvQK7IraS%2BRjnEIXPuY%3D&reserved=0 - a lot of it depends on if you are using CPU or GPU. If you are using CPU, you would just define the loss function (with the new lambdas) in your Snakefile and pass that function to the train_model as the loss_function argument.

If you are using GPU's on the other hand, you will need to pass everything you will need to define your loss function through the CLI. This is a bit tedious because then you have to create a corresponding commandline variable for each of those. I see you've already done that with your new lambdas https://github.com/USGS-R/river-dl/blob/7a92369e57e27a34ca6ff99fc4b216b6bc179874/river_dl/train_model.py#L62https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FUSGS-R%2Friver-dl%2Fblob%2F7a92369e57e27a34ca6ff99fc4b216b6bc179874%2Friver_dl%2Ftrain_model.py%23L62&data=04%7C01%7Cjbarclay%40usgs.gov%7C90a3beab5ebb44088f1208d9302f16d1%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637593801086554897%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HB3%2FCkxDHsX6Ke%2B0Kaknd9LGH7TlE%2FznRZE%2B7G%2B2jU8%3D&reserved=0. If we stick with passing the loss_function as a function (not a string), you will also need to handle the temp_mean, temp_sd, gw_mean, gw_std arguments too, since the loss function has to be defined before calling train_model (the function - not to be confused with train_model.py)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FUSGS-R%2Friver-dl%2Fissues%2F110%23issuecomment-861752429&data=04%7C01%7Cjbarclay%40usgs.gov%7C90a3beab5ebb44088f1208d9302f16d1%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637593801086564852%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Xa1nmXQiWNDsJs8vTYXq3%2Fk0k%2Fc1GeTjCasy%2BHsKJZ8%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA5H7UASKPN56NQZ7UFTJG3TS6OYDANCNFSM46VNTZIA&data=04%7C01%7Cjbarclay%40usgs.gov%7C90a3beab5ebb44088f1208d9302f16d1%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637593801086564852%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QXjL5Qd3BEvTaTunkmiPjklayDsxQ%2BGaBxemYWuVI%2Fs%3D&reserved=0.

SimonTopp commented 3 years ago

@jsadler2 I think alternative 2, passing a str to train.py, makes more sense as it avoids the need to comment out code between GPU and CPU, that might also make Janet's question more straightforward because in the CLI we could just pass the lambdas in as a list with nargs='+' to be an arbitrary number of lambdas. Does that make sense?

jsadler2 commented 3 years ago

@janetrbarclay -

Maybe that is better answered once the question of where / how to define the loss function is answered?

I think that's the key question. If we define it in the Snakefile then we can have less "project-specific" functionality in the centralriver-dl code base (i.e., we won't have to handle all of the eventual variations in train.py).

@SimonTopp

I think alternative 2, passing a str to train.py, makes more sense as it avoids the need to comment out code between GPU and CPU

Actually that str approach does not change the need to switch between the two rules for gpu/cpu. If we want to make that move, we could and just use the CLI for both. That would be another way of addressing Janet's suggestion in #113

SimonTopp commented 3 years ago

@jsadler2 Sorry, my comment wasn't very clear. The str approach would make it so the loss function is only defined once in any given run, whether it's in train or train_model_local_or_cpu. That way it's at least defined in a similar place between the two rather than up top for one and in the rule for the other (in my head I comment out the first definition of loss function when using GPU even though you don't actually have to).

janetrbarclay commented 3 years ago

@jsadler2 @SimonTopp This is a tough one. It seems the more we can keep the project / run - specific modifications to the outer files (config.yml > snakefile > river_dl code base) the better. Passing in a function makes this a little easier since some of the function inputs can be defined in the snakefile instead of in train.py, but that's not an option for the HPC GPU's, right? So it seems like we either need to use the str approach for both or have 2 "not-quite-duplicated" approaches (one with str and one with function), yes? Are there enough benefits to using the function approach that it's worth having both?

janetrbarclay commented 3 years ago

I've noticed that on pangeo I don't need to install river-dl but on tallgrass I do, which makes debugging easier on pangeo (since there's no uninstall - reinstall steps needed for each change). Thinking about that now, I'm guessing that may be a result of using the CLI on tallgrass? (and a potential benefit of having the non-CLI option available?)

SimonTopp commented 2 years ago

Closing this since it's outdated after removing train_model_cli in #177. Kinda fun to look back at all these old issues though!