USGS-R / river-dl

Deep learning model for predicting environmental variables on river systems
Creative Commons Zero v1.0 Universal
21 stars 15 forks source link

GW loss functions goes to Nan when running on GPU #130

Closed SimonTopp closed 2 years ago

SimonTopp commented 3 years ago

The groundwater loss function is currently explicitly put onto a CPU regardless of GPU availability because for some reason running it on a GPU causes the loss to go to Nan within the ~1st epoch. Not totally sure why this is, but based on various threads (i.e. keras-team/keras#1244), it could be either a hardware or software issue.

Things we've tried, but that have been unsuccessful, include gradient clipping (norm and value), increasing epsilon in the optimizer, changing the optimizer, limiting memory growth by the GPU, swapping out all Nans in Delta Phi and AR to 0, and changing all our values from float64 to float32. The loss function handles pre-training just fine, which might suggest it has something to do with too many Nans in our fine tuning dataset? For now the loss function is stable regardless of where you run it, but solving this eventually would be good.

SimonTopp commented 2 years ago

Wanted to update this quickly. At this point I don't this this is explicitly an issue with the GW loss function. I was able to look at when it breaks, and it seems like it does so when the predicted temperatures are all Nan. Why sometimes the predicted temperatures are all Nan and why the plain mutlitask rmse can handle this is another question. These Nan predictions are seemingly random, and we've seen them everywhere from the third epoch to the 100th (although they do seem to occur more frequently later in training). I'm digging into it more, but I basically have to let it run and hope it breaks before I can step through all the code and trace things back, so it's a slow process.

aappling-usgs commented 2 years ago

NaN predictions??? Why would we have even one of those?

SimonTopp commented 2 years ago

Not at all sure. I got an ssh tunnel between PyCharm and TG set up though so I could run the training routine with more advanced debugging capacity in the same environment where we're having the problems (only CPU though, no GPU yet). I put in a couple lines to break the training and attach a debugger if there were any Nan or Inf values in the loss function though, and when I caught one it all traced back to having nothing but Nan for predicted temperature being passed to the loss function. I've been trying to replicate it, but since we don't know what exactly causes it I can't force it to break. So I'm basically just re-running the model until it happens again to look deeper. It's a double wammy because you have to do it with eager execution in the function too which makes it slower.

image

aappling-usgs commented 2 years ago

PyCharmPro + brute force for the win (I hope)! Thanks for digging - this sounds important for multiple projects.

SimonTopp commented 2 years ago

@janetrbarclay, are you comfortable closing this issue now that we've moved to PyTorch where we no longer have the issue?

janetrbarclay commented 2 years ago

Yes, that sounds good to me.


Janet Barclay U.S. Geological Survey New England Water Science Center Connecticut Office 101 Pitkin St. East Hartford, CT 06108

Phone (office) 860 291-6763 Fax 860 291-6799 Email @.**@*.**@*.***> https://www.usgs.gov/staff-profiles/janet-barclay


From: Simon Topp @.> Sent: Wednesday, February 23, 2022 4:01 PM To: USGS-R/river-dl @.> Cc: Barclay, Janet R @.>; Mention @.> Subject: [EXTERNAL] Re: [USGS-R/river-dl] GW loss functions goes to Nan when running on GPU (#130)

This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.

@janetrbarclayhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjanetrbarclay&data=04%7C01%7Cjbarclay%40usgs.gov%7C55d0915a0f034fa555a608d9f70f9c4e%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637812468685956652%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=bBhzBNYb4XS9k07e7EE6Elx3%2FTAwkLUMFf99vudDe%2Bs%3D&reserved=0, are you comfortable closing this issue now that we've moved to PyTorch where we no longer have the issue?

— Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FUSGS-R%2Friver-dl%2Fissues%2F130%23issuecomment-1049211857&data=04%7C01%7Cjbarclay%40usgs.gov%7C55d0915a0f034fa555a608d9f70f9c4e%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637812468685956652%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=t%2BSgGcvCSIGIZ3PfTVuMseNeDYCxWCuLbKFgVsnpeG0%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA5H7UBII2XAKYF655NJPKTU4VDIXANCNFSM5CQJLPVQ&data=04%7C01%7Cjbarclay%40usgs.gov%7C55d0915a0f034fa555a608d9f70f9c4e%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637812468685956652%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=4%2FzCneO6o9fwJ8zxBp6%2B33hk1dkyFNUvZ1943HiNGgE%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cjbarclay%40usgs.gov%7C55d0915a0f034fa555a608d9f70f9c4e%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637812468685956652%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Q47G3k1hff6wOs92HyVqQq0iVeJIOSHgNFEss%2BYz1vY%3D&reserved=0 or Androidhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cjbarclay%40usgs.gov%7C55d0915a0f034fa555a608d9f70f9c4e%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637812468685956652%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=jd%2BxVMhwONLKgdFi5p9sfiwzNrBiXPJqDTCTBDMOmaM%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>