I also made adjustments to how we are using the DateTime and time(day of year) columns returned from GDP to create dates. The GCMs are built assuming that every year has 365 days, so if we just do mutate(date = as.Date(DateTime) + time(day of year)), we end up mismatching our dates because R assumes you want to include Feb 29 if it is a leap year. It was quite the workaround to get something that correctly converts DateTime and the day of year (ranging from 0:364) into the appropriate day and skip any Feb 29ths. It was not aided by the fact that the DateTime is supposed to represent the first date of the current year but is quite messy when returned from GDP and sometimes you have values like 1984-12-31 23:15:03 representing the first day of 1985. For this reason, I had to implement some interesting code that figured out the actual first day for the data, then added one more day to dates after Feb 28 if it was a leap year in order to correctly shift dates to not include Feb 29 ever. I confirmed that this works by comparing one lake's data from my build of this pipeline to that of the Winslow 2017 release GCMs in https://github.com/USGS-R/lake-temperature-model-prep/issues/273#issuecomment-1032031775.
Unfortunately, the target for cleaning all the GCMs went from taking 2.5 minutes to just over 5 minutes after these leap year changes (not surprised due to the grouping etc that goes on). However, given that the download takes 9 hrs still, I felt like 5 minutes wasn't worth fretting over for now. Using data.table functions (or a hybrid approach with dtplyr) could probably give us some gains, but I didn't want to spend too much time on optimizing now, especially with the looming deadline and without having it reviewed first.
More for #273. This uses the new grid from #297 and the new tile size from #298 to redownload the downscaled GCMs. First, this cut the download time down from 16 hours to 9 hrs 🎉. Second, the work described in https://github.com/USGS-R/lake-temperature-model-prep/issues/273#issuecomment-1030234651 and https://github.com/USGS-R/lake-temperature-model-prep/issues/273#issuecomment-1031981599 makes me confident that we have the grid correct.
I also made adjustments to how we are using the
DateTime
andtime(day of year)
columns returned from GDP to create dates. The GCMs are built assuming that every year has 365 days, so if we just domutate(date = as.Date(DateTime) + time(day of year))
, we end up mismatching our dates because R assumes you want to include Feb 29 if it is a leap year. It was quite the workaround to get something that correctly converts DateTime and the day of year (ranging from 0:364) into the appropriate day and skip any Feb 29ths. It was not aided by the fact that theDateTime
is supposed to represent the first date of the current year but is quite messy when returned from GDP and sometimes you have values like1984-12-31 23:15:03
representing the first day of 1985. For this reason, I had to implement some interesting code that figured out the actual first day for the data, then added one more day to dates after Feb 28 if it was a leap year in order to correctly shift dates to not include Feb 29 ever. I confirmed that this works by comparing one lake's data from my build of this pipeline to that of the Winslow 2017 release GCMs in https://github.com/USGS-R/lake-temperature-model-prep/issues/273#issuecomment-1032031775.Unfortunately, the target for cleaning all the GCMs went from taking 2.5 minutes to just over 5 minutes after these leap year changes (not surprised due to the grouping etc that goes on). However, given that the download takes 9 hrs still, I felt like 5 minutes wasn't worth fretting over for now. Using
data.table
functions (or a hybrid approach withdtplyr
) could probably give us some gains, but I didn't want to spend too much time on optimizing now, especially with the looming deadline and without having it reviewed first.