DOI-USGS / lake-temperature-model-prep

Pipeline #1
Other
6 stars 13 forks source link

Tiny edits (long explanation) to prevent mysterious rebuilding #322

Closed lindsayplatt closed 2 years ago

lindsayplatt commented 2 years ago

While review @padilla410's pull request #321, I bumped into two targets that started to rebuild unexpectedly. After investigating, I figured out that there were two small changes I could make to fix those. The issues leading up to those discoveries are described below.

Rebuild issue 1: 7_config_merge/out/nml_meteo_fl_values.rds.ind started to rebuild ⛔

This was weird to me because I was really not expecting that target to start rebuilding. I had first set options(scipiper.dry_put = TRUE) then ran scmake('8_viz') to test that Julie's changes in #321 would work for me and this target started rebuilding. I stopped the build, then ran why_dirty('7_config_merge/out/nml_meteo_fl_values.rds.ind'). Based on the output, something was funky with the target NLDAS_grid.

> why_dirty('7_config_merge/out/nml_meteo_fl_values.rds.ind')
Since the last build of the target '7_config_merge/out/nml_meteo_fl_values.rds.ind':
* the dependency 'NLDAS_grid' has changed
# A tibble: 8 x 8
type name hash_old hash_new hash_mismatch dirty dirty_by_descent current
<chr> <chr> <chr> <chr> <lgl> <lgl> <lgl> <lgl>
1 target 7_config_merge/out/nml_meteo_fl_values.rds.ind acf8296113~ acf829611~ FALSE TRUE FALSE FALSE
2 depends 2_crosswalk_munge/out/centroid_lakes_sf.rds.ind 7ffeb07809~ 7ffeb0780~ FALSE FALSE FALSE TRUE
3 depends 7_config_merge/out/nml_lake_depth_values.rds.ind d3054664c5~ d3054664c~ FALSE FALSE FALSE TRUE
4 depends NLDAS_grid c6120c6ee2~ ee53fdcd2~ TRUE FALSE FALSE TRUE
5 depends replete_time_range 475e6ae68a~ 475e6ae68~ FALSE FALSE FALSE TRUE
6 fixed NA 973e00ff13~ 973e00ff1~ FALSE FALSE FALSE TRUE
7 function create_meteo_filepath ad0e4ef38c~ ad0e4ef38~ FALSE FALSE FALSE TRUE
8 function munge_meteo_fl b2edbbbecc~ b2edbbbec~ FALSE FALSE FALSE TRUE

When I opened 6_drivers_fetch.yml to look at NLDAS_grid, I saw this (you can tell something is up with the formatting because the comment is not styled as a comment like it should be).

MicrosoftTeams-image (12)

Comments in makefiles need to be above target definitions, they cannot be on the same line. I think this was causing the issue and making NLDAS_grid get a new hash. When I moved the comment above the target (change in this PR), and then used why_dirty() as I did previously, the target was no longer dirty and therefore wasn't going to rebuild. YAY! 🎉

> why_dirty('7_config_merge/out/nml_meteo_fl_values.rds.ind')
Error in why_dirty("7_config_merge/out/nml_meteo_fl_values.rds.ind") :
target '7_config_merge/out/nml_meteo_fl_values.rds.ind' is not dirty

Rebuild issue 2: 8_viz/inout/lakes_summary.feather.ind started to rebuild ⛔

I got passed the first issue and then made sure to run options(scipiper.dry_put = TRUE) before scmake('8_viz') to continue testing Julie's PR changes. I got stuck on this next target and when I looked at the why_dirty() for it, I got the following:

> why_dirty('8_viz/inout/lakes_summary.feather.ind')
Since the last build of the target '8_viz/inout/lakes_summary.feather.ind':
* the dependency '7b_temp_merge/out/temp_data_with_sources.feather' has changed

When I went to look at where the 7b_temp_merge/out/temp_data_with_sources.feather file was used, I noticed that it was the only file passed into the function, merge_lake_data(), as the raw filepath and not the indicator file. I assumed that what was happening is my pipeline had no reason to pull down the most recent version of that file using sc_retrieve(), so it was different compared to what was used last time. After I ran sc_retrieve("7b_temp_merge/out/temp_data_with_sources.feather.ind") myself and then tried scmake('8_viz') again, it no longer tries to rebuild this target:

> why_dirty('8_viz/inout/lakes_summary.feather.ind')
Error in why_dirty("8_viz/inout/lakes_summary.feather.ind") :
target '8_viz/inout/lakes_summary.feather.ind' is not dirty

To avoid this in the future (and to match how all other data files are referenced in that function), I updated it to use the .ind file for the temperature data rather than the raw feather file. So, sc_retrieve() is always called for it, which avoids this issue of the local data file being different than what the pipeline expects and causing a rebuild. YAY! 🎉

Other build files that came along for the ride

Here are a few additional comments about what just rebuilt and is considered "changed" in this PR with an explanation (mostly so that when someone searches the repo, these will pop-up to provide answers!):

  1. 6_temp_coop_fetch/out/coop_all_files.rds.tind + coop_wants: the .tind indicator file ending stands for "time indicator" and it is because we have that file set up to always rebuild. The file build .tind ends up with the new hash and the freshened field, meaning the latest date that the file got rebuilt. Again, this builds every time scmake() is called and that file is in the dependency tree of your target of interest. coop_wants always rebuilds because it depends on that .tind file which is always different thanks to it always rebuilding and getting the new timestamp.
  2. 8_viz/out/lake_obs_counts.png: this rebuilt somewhat unexpectedly. I'm not able to see any differences using swipe or onion skin in the file changes tab. Nothing seems to be different from what I can tell! Mysterious ... 🕵️‍♂️
  3. 7a_temp_coop_munge/tmp/parser_files.yml + coop_parsers: again, this target is setup to always build. It does not have an indicator file associated, so there was nothing to commit but it did rebuild. coop_parsers depends on that .yml so it always rebuilds, too. I get nervous when things I don't expect start to rebuild, so I wanted to explicitly call these out in case someone (even future me) searches these targets looking for answers in this repo 😃
padilla410 commented 2 years ago

Thanks for tagging me in this, @lindsayplatt. I learned a lot!

jordansread commented 2 years ago

7a_temp_coop_munge/tmp/parser_files.yml + coop_parsers: again, this target is setup to always build. It does not have an indicator file associated, so there was nothing to commit but it did rebuild. coop_parsers depends on that .yml so it always rebuilds, too. I get nervous when things I don't expect start to rebuild, so I wanted to explicitly call these out in case someone (even future me) searches these targets looking for answers in this repo 😃

My understanding is that we want this target to rebuild every time because otherwise file changes in the parsers don't get tracked. Since it is a quick target to rebuild and if none of the file hashes change (or new/removed ones) it doesn't trigger downstream builds. This greedy build was added as part of the "brittle coop pipeline" fixes if I remember correctly.

When I went to look at where the 7b_temp_merge/out/temp_data_with_sources.feather file was used, I noticed that it was the only file passed into the function, merge_lake_data(), as the raw filepath and not the indicator file

I don't really know why this one would be a file and the others would be ind...I feel like we probably had a reason for doing that but don't have a memory of it. I also remember a few other instances of using the files as dependencies instead of the .ind, and we were using a function called "require_local" to handle this difference. I admit I never developed a good understanding of what patterns we were using that needed that and I think it has been phased out of this pipeline but perhaps this file is a remnant of that old pattern?

jordansread commented 2 years ago

More for the parser files yaml. Yes, this was connected to an "always stale" pattern for the reason I described above.

I moved the "forever stale" target from https://github.com/USGS-R/lake-temperature-model-prep/pull/242 to 1_fetch so it could be used by other steps easily

☝️ from this PR