Open xylar opened 4 days ago
@mahf708, do you have suggestions (e.g. containers) that we could use for e3sm_diags instead of downloading directly from the LCRC server?
We are seeing time-outs in https://github.com/conda-forge/e3sm_diags-feedstock/pull/38, which are likely to cause ongoing trouble building conda packages.
I was able to get CI to pass on https://github.com/conda-forge/e3sm_diags-feedstock/pull/38 after restarting 3 times.
After 8 attempts, I was finally able to get the conda package to build. Needless to say, this is not sustainable.
I will write a more detailed comment, but I think we should use a container. I can make one for e3sm diags like the others I made for testing in https://github.com/E3SM-Project/containers
It's been on my list of todos to get a generic conda container that has some of our data from the servers...
I disabled two workflows (scream defaults and mkatmsrf...) because of this very reason
@xylar yes, this now became an outstanding issue, and we should find alternatives for hosting data needed for CI. Does mpas-analysis has a similar need or it is handled differently?
@mahf708 it looks like the container repo, you already have codes for data from input data directory, it sounds like we can just mimic it to add the diagnostics data.
@chengzhuzhang, this issue doesn't affect MPAS-Analysis because we don't try to do anything so sophisticated in CI. I still run tests manually on Chrysalis as needed.
I had a inclination that running the GH Actions build with Python 3.9-3.12 while simultaneously downloading the same data for each run would throttle LCRC. We can make GH Actions only run when a PR is marked as ready for review if a short-term solution is needed. A possible alternative solution that was mentioned before is to cache the diagnostic data on GitHub Actions, then updating the cache if updated diags data on LCRC is detected.
It looks like we still need a general solution for azure pipelines though.
I think we really need to make it forbidden to download files from LCRC in CI. It's badly affecting our ability to do other work.
So I think even if we allow it in fewer circumstances, it's still not good enough.
I would like to make a container based on the official conda-forge miniconda container, then add the needed inputdata to it. I will put up a prototype on https://github.com/E3SM-Project/containers in the next few days (I need to collect info about the data needed)
@xylar Makes sense to me.
@mahf708 Let me know if you'd like me to test it out with e3sm_diags when it is ready.
Can we get @rljacob to weigh in just in case he prefers something else?
Rob, should we institute a policy that none of our testing should be touching the inputdata server? I doubt it is the sole reason we are seeing issues, but who knows...
I am happy to streamline a few containers with everything we need, so that we have no reason to download stuff from the server
A resolution is offered in #901
We have been getting downloads throttled by LCRC on https://web.lcrc.anl.gov/ because we use too much bandwidth. This is affecting research.
We should seek alternatives to https://web.lcrc.anl.gov/ in our CI, e.g.: https://github.com/E3SM-Project/e3sm_diags/blob/ca41b0e5d913610c88410928951f1ed11c75663f/tests/integration/download_data.py#L89