USGS-R / river-dl

Deep learning model for predicting environmental variables on river systems
Creative Commons Zero v1.0 Universal
21 stars 15 forks source link

"as-run" config to text file? #137

Closed janetrbarclay closed 2 years ago

janetrbarclay commented 2 years ago

I've added a rule to my project Snakefile that writes the config file, some git specs (branch and commit), and the run date to a text file in the output directory at run-time. @jsadler2 @SimonTopp @aappling-usgs @jzwart Is this something we'd like in the base river-dl Snakefile?

jzwart commented 2 years ago

That seems great, this is for improving model run reproducibility? I also wonder if we could add dataset versions too

SimonTopp commented 2 years ago

I think it's a great idea! I like the idea of dataset versioning too, but not sure how detailed we want it to be. Would a tag of when the data was downloaded from Science base be sufficient? We could add something to the data-prep pipeline that updates the data accessed tag each time you run it.

jsadler2 commented 2 years ago

Great idea!

janetrbarclay commented 2 years ago

I agree it would be great to include dataset info. What kind of tracking info we have for that? It would be easy to grab the file metadata for the input files, which gives some info. The data prep pipeline that Simon referenced already saves a txt file with summary stats for data file (min / max / # of NA's / etc for each variable). I could update that to have a date, etc in it (but I'm not sure that anyone besides Simon and I are using that so that may be problematic for others using river-dl)


Janet Barclay U.S. Geological Survey New England Water Science Center Connecticut Office 101 Pitkin St. East Hartford, CT 06108

Phone (office) 860 291-6763 Fax 860 291-6799 Email @.**@*.**@*.***> https://www.usgs.gov/staff-profiles/janet-barclay


From: Jake Zwart @.> Sent: Monday, October 25, 2021 12:33 PM To: USGS-R/river-dl @.> Cc: Barclay, Janet R @.>; Author @.> Subject: [EXTERNAL] Re: [USGS-R/river-dl] "as-run" config to text file? (Issue #137)

This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.

That seems great, this is for improving model run reproducibility? I also wonder if we could add dataset versions too

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FUSGS-R%2Friver-dl%2Fissues%2F137%23issuecomment-951101569&data=04%7C01%7Cjbarclay%40usgs.gov%7Cbbd32638eb6044ada48308d997d5402e%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637707764426837249%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6KBY%2FReUNQPk13l17uj9udSVgizlpNXqJDZPA0%2BBk04%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA5H7UEQIOUB4XTGVWL6MIDUIWBFZANCNFSM5GRDMSTA&data=04%7C01%7Cjbarclay%40usgs.gov%7Cbbd32638eb6044ada48308d997d5402e%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637707764426837249%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Hm7yPReIVixZQH9erfs71MlU8vxQlmyGIcS%2FqYzRL4k%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cjbarclay%40usgs.gov%7Cbbd32638eb6044ada48308d997d5402e%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637707764426847206%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=KjyGIQA47MpQ9xX%2FrUzsTWc29N84pSFfxWrZoKODibU%3D&reserved=0 or Androidhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cjbarclay%40usgs.gov%7Cbbd32638eb6044ada48308d997d5402e%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637707764426847206%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=vKg5iGQVqrAb%2FJCrV74RooBO6ea%2FGy4nf%2Baz1LGjF1U%3D&reserved=0.

jzwart commented 2 years ago

Good question about how detailed we want for dataset versioning. The summary stats is great and the config file should have the input file names already, and maybe that's the best we can do right now. I wonder if storing dataset versions in the data we produce is something we should start doing, which would make it easier to pull out that metadata in modeling pipelines like this one.

SimonTopp commented 2 years ago

One easy lift would just be a naming convention for our input files that includes the date they were downloaded from SB. Long term I like the idea of dataset versions with associated metadata though.

jzwart commented 2 years ago

@whiteellie mentioned Neptune for dataset versioning during our iterative model development meeting - https://docs.neptune.ai/how-to-guides/data-versioning . something to keep in mind for later.