Closed SimonTopp closed 2 years ago
Thanks Simon for making this issue.
lambdas
should be used in the definition of the loss function that is used when the model is compiled on this line:
https://github.com/USGS-R/river-dl/blob/9d2d340dd1f95bb09eb00a04cad5e73327fdb65e/river_dl/train_model.py#L94
So we need to just pass along that variable.
We will need to think about how lamdas
which is a list
is passed over the commandline. I think this will work: https://stackoverflow.com/a/43786290/5811195
Ok, I think I understand. Although in looking at this deeper it seems like maybe there's a better way to do this. If I understand the code we're defining the loss function once in the at line 12 in the Snakefile https://github.com/USGS-R/river-dl/blob/9d2d340dd1f95bb09eb00a04cad5e73327fdb65e/Snakefile#L12 but then we don't pass that to the command line call for train_model, instead we re-define it on line 62. https://github.com/USGS-R/river-dl/blob/9d2d340dd1f95bb09eb00a04cad5e73327fdb65e/Snakefile#L62 This seems a little error prone if someone doesn't notice and only changes it in one place. Does it make more sense to change the upstream definition/pass it through to avoid confusion? Also, apologies if I'm not looking at this right.
A related question - I add 2 additional lambdas with the GW loss function. Should I add these to this list of lambdas or keep them as separate input terms?
These are really good/tough questions.
@SimonTopp - yes you are exactly right that there are two places where the loss function is currently defined in the Snakefile. I like the idea of just having one place where it is defined as you suggested. I'm not sure how that would work, though.
The hang up is that commandline interface. That restricts us to passing a string as an argument. Whereas, currently, the train_model
function https://github.com/USGS-R/river-dl/blob/9d2d340dd1f95bb09eb00a04cad5e73327fdb65e/river_dl/train.py#L23 takes as an argument loss_function
which is a python function. So I see two possible modifications:
loss_function
in the Snakefile as a parameter of the train_model_local_or_cpu
rule. This would make it clear that the loss_function variable is being used for this rule only and would be commented out if using the CLI on tallgrasstrain_model
's loss_function
argument to be a str
instead of function
and then we use a function like is currently in the train_model.py
to get the function that corresponds to the string. https://github.com/USGS-R/river-dl/blob/9d2d340dd1f95bb09eb00a04cad5e73327fdb65e/river_dl/train_model.py#L6There may be some other alternative and if you think of one, feel free to suggest it.
@janetrbarclay - a lot of it depends on if you are using CPU or GPU. If you are using CPU, you would just define the loss function (with the new lambdas) in your Snakefile and pass that function to the train_model
as the loss_function
argument.
If you are using GPU's on the other hand, you will need to pass everything you will need to define your loss function through the CLI. This is a bit tedious because then you have to create a corresponding commandline variable for each of those. I see you've already done that with your new lambdas https://github.com/USGS-R/river-dl/blob/7a92369e57e27a34ca6ff99fc4b216b6bc179874/river_dl/train_model.py#L62. If we stick with passing the loss_function
as a function (not a string), you will also need to handle the temp_mean, temp_sd, gw_mean, gw_std
arguments too, since the loss function has to be defined before calling train_model
(the function - not to be confused with train_model.py
)
I guess I was thinking more conceptually - do we want to have the GW lambdas (when used) appended to the existing list of lambdas or kept separate? Maybe that is better answered once the question of where / how to define the loss function is answered?
Janet Barclay U.S. Geological Survey New England Water Science Center Connecticut Office 101 Pitkin St. East Hartford, CT 06108
Phone (office) 860 291-6763 Fax 860 291-6799 Email @.**@*.**@*.***> https://www.usgs.gov/staff-profiles/janet-barclay
From: Jeff Sadler @.> Sent: Tuesday, June 15, 2021 2:54 PM To: USGS-R/river-dl @.> Cc: Barclay, Janet R @.>; Mention @.> Subject: [EXTERNAL] Re: [USGS-R/river-dl] Lambdas parameter for RGCN on HPC GPU (#110)
This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.
@janetrbarclayhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjanetrbarclay&data=04%7C01%7Cjbarclay%40usgs.gov%7C90a3beab5ebb44088f1208d9302f16d1%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637593801086554897%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=v78lLOH8R3PRkklsEXqnuUdfwvQK7IraS%2BRjnEIXPuY%3D&reserved=0 - a lot of it depends on if you are using CPU or GPU. If you are using CPU, you would just define the loss function (with the new lambdas) in your Snakefile and pass that function to the train_model as the loss_function argument.
If you are using GPU's on the other hand, you will need to pass everything you will need to define your loss function through the CLI. This is a bit tedious because then you have to create a corresponding commandline variable for each of those. I see you've already done that with your new lambdas https://github.com/USGS-R/river-dl/blob/7a92369e57e27a34ca6ff99fc4b216b6bc179874/river_dl/train_model.py#L62https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FUSGS-R%2Friver-dl%2Fblob%2F7a92369e57e27a34ca6ff99fc4b216b6bc179874%2Friver_dl%2Ftrain_model.py%23L62&data=04%7C01%7Cjbarclay%40usgs.gov%7C90a3beab5ebb44088f1208d9302f16d1%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637593801086554897%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HB3%2FCkxDHsX6Ke%2B0Kaknd9LGH7TlE%2FznRZE%2B7G%2B2jU8%3D&reserved=0. If we stick with passing the loss_function as a function (not a string), you will also need to handle the temp_mean, temp_sd, gw_mean, gw_std arguments too, since the loss function has to be defined before calling train_model (the function - not to be confused with train_model.py)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FUSGS-R%2Friver-dl%2Fissues%2F110%23issuecomment-861752429&data=04%7C01%7Cjbarclay%40usgs.gov%7C90a3beab5ebb44088f1208d9302f16d1%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637593801086564852%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Xa1nmXQiWNDsJs8vTYXq3%2Fk0k%2Fc1GeTjCasy%2BHsKJZ8%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA5H7UASKPN56NQZ7UFTJG3TS6OYDANCNFSM46VNTZIA&data=04%7C01%7Cjbarclay%40usgs.gov%7C90a3beab5ebb44088f1208d9302f16d1%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637593801086564852%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QXjL5Qd3BEvTaTunkmiPjklayDsxQ%2BGaBxemYWuVI%2Fs%3D&reserved=0.
@jsadler2 I think alternative 2, passing a str
to train.py, makes more sense as it avoids the need to comment out code between GPU and CPU, that might also make Janet's question more straightforward because in the CLI we could just pass the lambdas in as a list with nargs='+'
to be an arbitrary number of lambdas. Does that make sense?
@janetrbarclay -
Maybe that is better answered once the question of where / how to define the loss function is answered?
I think that's the key question. If we define it in the Snakefile then we can have less "project-specific" functionality in the centralriver-dl
code base (i.e., we won't have to handle all of the eventual variations in train.py
).
@SimonTopp
I think alternative 2, passing a str to train.py, makes more sense as it avoids the need to comment out code between GPU and CPU
Actually that str
approach does not change the need to switch between the two rules for gpu/cpu. If we want to make that move, we could and just use the CLI for both. That would be another way of addressing Janet's suggestion in #113
@jsadler2 Sorry, my comment wasn't very clear. The str
approach would make it so the loss function is only defined once in any given run, whether it's in train
or train_model_local_or_cpu
. That way it's at least defined in a similar place between the two rather than up top for one and in the rule for the other (in my head I comment out the first definition of loss function when using GPU even though you don't actually have to).
@jsadler2 @SimonTopp This is a tough one. It seems the more we can keep the project / run - specific modifications to the outer files (config.yml > snakefile > river_dl code base) the better. Passing in a function
makes this a little easier since some of the function inputs can be defined in the snakefile instead of in train.py
, but that's not an option for the HPC GPU's, right? So it seems like we either need to use the str
approach for both or have 2 "not-quite-duplicated" approaches (one with str
and one with function
), yes? Are there enough benefits to using the function
approach that it's worth having both?
I've noticed that on pangeo I don't need to install river-dl but on tallgrass I do, which makes debugging easier on pangeo (since there's no uninstall - reinstall steps needed for each change). Thinking about that now, I'm guessing that may be a result of using the CLI on tallgrass? (and a potential benefit of having the non-CLI option available?)
Closing this since it's outdated after removing train_model_cli
in #177. Kinda fun to look back at all these old issues though!
https://github.com/USGS-R/river-dl/blob/9d2d340dd1f95bb09eb00a04cad5e73327fdb65e/Snakefile#L58
Currently config.yml defines lambdas as a list, but when calling train_model in the snakefile for HPC GPU it tries to pull a single "lamb" argument from config.yml. There are potentially problems changing the "lamb" in train_model to "lambdas" because the RGCN isn't expecting a list. It's also unclear if the lambdas argument is used in defining the loss function in train_model.py.