global-monitoring GH Action failure (`{renv}` + caching = problem)

OCHA-DAP / hdx-signals

HDX Signals

https://un-ocha-centre-for-humanitarian.gitbook.io/hdx-signals/

GNU General Public License v3.0

6 stars 0 forks source link

global-monitoring GH Action failure (`{renv}` + caching = problem) #12

Closed zackarno closed 1 year ago

zackarno commented 1 year ago

Several failures in CRON scheduled GH actions (50-74). The issue appears to arise due to the interaction of package caching the {renv} package. Issue can be manually fixed by deleting cache in GitHub and manually running workflow again. After deleting cache CRON job will run successfully several times until failing, at which point the above can be repeated.

Rather than having to continually check and delete cache on the GitHub repo, I think an easy fix is just to delete the caching step of the action. I’d rather have a 10 minute fully automated run the having to check and manually trigger for a faster 2 minute run.

I removed the caching step in a branch with 713bc0f and had a successful manual run (#75). I will do a PR, merge, and monitor the CRON jobs over the coming week Will post some leads on the {renv} + caching issue to hopefully troubleshoot and integrate caching back in eventually

zackarno commented 1 year ago

Scheduled runs successful w/ cache action removed. Therefore, I temporarily increased CRON freq to more quickly make sure removal of caching step will continue to be a viable stopgap over multiple runs (GH action w/ caching would run successfully multiple times and then fail) 30e2ab4ae9e0a8fd21df4ae62dbd5bc500089407

If no issues in GH action runs, will change freq back to just M-F at 11 am

zackarno commented 1 year ago

run # 78 failed even though caching step was removed. Therefore, the issue appears not directly related the GHA caching step, rather just renv::restore() and Ripc. I re-ran #78 again and it failed. I then deleted the cache and ran as a new workflow (#79) and it succeeded. Don't know if deleting the cache or running as new workflow was the key to success in this case as the caching step would seem unrelated now.

I've noticed a pattern of 3 successful runs -> failure.

I am going to add the caching action back in and reset to normal schedule soon.

No idea if related, but wondering if the use of {memoise} in Ripc could be doing something funky and wonder if adding memoise::forget(Ripc:::ipc_get) could help. I might play with adding that to the top of main run.R script

zackarno commented 1 year ago

Some failed runs - issue still not resolved so I reverted yaml back to how it was:

Since caching action was not the issue, I added the step back in e847339918a25cbe3fe6966a481d5f9b86894804
switched CRON job back to intended schedule 6b16ed92b950629d43d274a3b2b9f613663f503c

By manually re-running it seems to eventually work successfully

caldwellst commented 1 year ago

The issue with the global monitoring seemed to stem from rhdx and sf at different points in time running on MacOS. In particular, you have to worry about installing from source or binary for sf. Not sure what the issue with rhdx was. I have finally gotten the system to re-run running on ubuntu-latest and I removed the already deprecated functions from Ripc that relied on rhdx, so we no longer need to use the package.

Will re-open if we run into issues again, but hopefully this has solved everything!