MesserLab / SLiM

SLiM is a genetically explicit forward simulation software package for population genetics and evolutionary biology. It is highly flexible, with a built-in scripting language, and has a cross-platform graphical modeling environment called SLiMgui.
https://messerlab.org/slim/
GNU General Public License v3.0
160 stars 30 forks source link

Fix CI in macos & move out to mamba #457

Closed currocam closed 1 month ago

currocam commented 1 month ago

Hi there!

As you proposed in #455 , I have given it a try to fix the CI in macos. After some tries (35 apparently according to the git log) I think I solved it. It is a pretty dense commit though. As a summary:

  1. I tried to homogenize the style
  2. I moved out the CI from using conda to using mamba. That should give better runtimes, but to be honest I did it so I could copy-paste the example of how to cache the environment from the documentation.
  3. I removed pip dependencies. Instead, a conda environment is created.
  4. The cached conda environment is recreated every day (if necessary). I actually found that I could not recreate the cached packages of the SLiM repo because it was missing the conda-forge channel, so this might be a good idea.
  5. I have updated the Github Actions from the marketplace to the latest available version.
  6. I think apt-update takes long enough to make a separate case only when we require gcc-11

The runtime has not changed a lot. I run both at the same time, and my version takes around 21-22m minutes while the previous version takes around 24 min. If you take into account that the previous version does not run the macOS versions, I think it's fair to say that these changes did not make CI slower. That said, I was a bit surprised to find separate jobs for building the GUI and CLI. I would combine then, as that would probably make CI substantially faster.

bhaller commented 1 month ago

Hi there!

Hi @currocam!

As you proposed in #455 , I have given it a try to fix the CI in macos. After some tries (35 apparently according to the git log) I think I solved it. It is a pretty dense commit though. As a summary:

  1. I tried to homogenize the style

OK. That's hard to review with diffs, of course, but probably the changes are great. I think with this PR I will probably just have to take a flyer and trust that you know what you're doing. :->

  1. I moved out the CI from using conda to using mamba. That should give better runtimes, but to be honest I did it so I could copy-paste the example of how to cache the environment from the documentation.

OK. I know very little about conda, and I don't think I've even heard of mamba before; so again I think I'll just trust that you know what you're doing.

  1. Rather than installing pip dependencies and conda dependencies, it recreates the conda environment.
  2. The cached conda environment is recreated every day. I actually found that I could not recreate the cached packages of the SLiM repo because it was missing the conda-forge channel, so it might be a good idea.

Not sure exactly what you're saying here, sorry. I'm really clueless about this stuff. What are you saying you couldn't do, and what might be a good idea?

I also wonder: you say the cached environment gets recreated "every day", meaning once per day I guess? What about if tskit, msprime, pyslim, etc. have actually changed, within a day? Does it notice the version mismatch and do a recache? Or do we need to wait until the next day, when the recache will occur automatically, to get the new version?

  1. I have updated the Github Actions from the marketplace to the latest available version.

Huh, OK. Good to stay current, I imagine. Does this have any practical consequences for us that you're aware of?

  1. I think apt-update takes long enough to make a separate case only when we require gcc-11

Not sure what you mean here, sorry.

The runtime has not changed a lot. I run both at the same time, and my version takes around 21-22m minutes while the previous version takes around 24 min. If you take into account that the previous version does not run the macOS versions, I think it's fair to say that these changes did not make CI slower. That said, I was a bit surprised to find separate jobs for building the GUI and CLI. I would combine then, as that would probably make CI substantially faster.

If I understand what you're suggesting here, I think it does not work. Building the GUI is not just an additional target on top of the eidos, core, etc. targets; in fact, the eidos, core, etc. targets build differently when they are being built in the context of the GUI. (More precisely, there are places in their code where conditional compilation occurs, depending on whether they are being built for the CLI or for the GUI.) So reuse of the built libraries/targets between the command-line tools and the GUI is not possible. That's the reason the GUI and CLI are done independently in the CI (and when building targets in CMake, too, in fact).

So. Given my limited understanding of all this, I need to ask:

Thanks very much for this work; it looks like a major cleanup, which was really needed; the way it was before was kind of accreted over time, with lots of changes by me that were made with little understanding of how GH CI really works. :->

currocam commented 1 month ago

Hi @bhaller Sorry for not explaining myself properly, I will try it now :)

  1. I tried to homogenize the style

OK. That's hard to review with diffs, of course, but probably the changes are great. I think with this PR I will probably just have to take a flyer and trust that you know what you're doing. :->

I have installed the GitHub Actions linter from Microsoft in my editor, so I think there should not be many format/style issues with this PR.

3. I moved out the CI from using conda to using mamba. That should give better runtimes, but to be honest I did it so I could copy-paste the example of how to cache the environment from the documentation.

OK. I know very little about conda, and I don't think I've even heard of mamba before; so again I think I'll just trust that you know what you're doing.

Sorry for not giving more context. mamba is a drop-in replacement of Conda written in C++ (rather than pure python). It is faster and they claim they resolve dependencies better (and it does in my experience), but I don't think it is a big issue choosing it over conda (although someone might correct me).

I actually found that I could not recreate the cached packages of the SLiM repo because it was missing the conda-forge channel

My bad here, that is not true. There's nothing wrong with how the cache works in Ubuntu in the previous version. The caches used in master were created 3 days ago, I must have misspecified something.

I also wonder: you say the cached environment gets recreated "every day", meaning once per day I guess? What about if tskit, msprime, pyslim, etc. have actually changed, within a day? Does it notice the version mismatch and do a recache? Or do we need to wait until the next day, when the recache will occur automatically, to get the new version?

Yes, you are right. The cached conda environment gets recreated once per day. You could reach an inconsistent state between CI and a user if you trigger the CI in SLiM on the same day, but earlier than a dependency, pyslim, for example, releases a breaking change in conda. In that scenario, the cache would be invalid until the next day. If you noticeit, it is possible to reset the cache by (1) as you said, waiting until next day when the recache will occur automatically or (2) by increasing the CACHE_NUMBER value in the tests.yml file (see comment in the file).

Alternatively, we may just omit the cache, as recreating the environment is quite fast. Last CI took 15s in creating the environment in Ubuntu 20.04, 44 seconds in macos-12 and 40 seconds in Windows. I guess that caching the conda environment decreases the runtime by 5 minutes at most (although most jobs run in parallel).

Also, if the goal is to find as soon as possible that there has been a breaking change with a dependency, then I think we could consider adding a test that is triggered by a cron schedule (every night or something).

9. I have updated the Github Actions from the marketplace to the latest available version.

Huh, OK. Good to stay current, I imagine. Does this have any practical consequences for us that you're aware of?

I don't think so. I checked the CHANGELOG before doing so, and it did not seem to affect any of the flags we were using.

11. I think apt-update takes long enough to make a separate case only when we require gcc-11

Not sure what you mean here, sorry.

Sorry again for not providing enough context. The commit 4bcc36a you made is only strictly needed when you are going to install gcc-11. In my preliminary tests, I found that adding the ppa:ubuntu-toolchain-r/test repository makes the next step, installing the required gcc version, slower. Then, what I do is just to add the repository if gcc-11 is actually gonna be installed. It is a micro-optimization, though. I don't think it improves the runtime by more than a minute.

If I understand what you're suggesting here, I think it does not work. Building the GUI is not just an additional target on top of the eidos, core, etc. targets; in fact, the eidos, core, etc. targets build differently when they are being built in the context of the GUI.

Thank you for clarifying this. Then, I think it is fine as it is (as separate jobs for GUI and CLI).

So. Given my limited understanding of all this, I need to ask:

  • is this ready to merge, or do you have more you intend to do?

  • are there any drawbacks or problems with what you have done here, compared to the way it was before?

  • are you entirely happy with where this ended up? should I hesitate to merge it, for any reason, in your view?

There are no drawbacks I am aware of, and I think it is ready to merge. I think I am happy with how it is right now. I considered maintaining the old CI for a while, to be extra sure there is nothing wrong. I think it is not a silly idea. It would require just to have both files in the workflow directory.

I'm not sure how the pricing of the GitHub Actions works. I think public repos don't have to pay, but I guess there must be an upper limit. I wonder how far away you are from it. If the number of hours is not an issue, I would also suggest to trigger the CI every night/week automatically and not only with every commit. That would require a minimal chance in the YML file.

I hope I was clearer this time!

bhaller commented 1 month ago

I have installed the GitHub Actions linter from Microsoft in my editor, so I think there should not be many format/style issues with this PR.

OK, great.

Sorry for not giving more context. mamba is a drop-in replacement of Conda written in C++ (rather than pure python). It is faster and they claim they resolve dependencies better (and it does in my experience), but I don't think it is a big issue choosing it over conda (although someone might correct me).

I see. Should be fine, we're not using it for anything fancy.

...The cached conda environment gets recreated once per day. You could reach an inconsistent state between CI and a user if you trigger the CI in SLiM on the same day, but earlier than a dependency, pyslim, for example, releases a breaking change in conda. In that scenario, the cache would be invalid until the next day. If you notice it, it is possible to reset the cache by (1) as you said, waiting until next day when the recache will occur automatically or (2) by increasing the CACHE_NUMBER value in the tests.yml file (see comment in the file).

OK, makes sense.

Alternatively, we may just omit the cache, as recreating the environment is quite fast. Last CI took 15s in creating the environment in Ubuntu 20.04, 44 seconds in macos-12 and 40 seconds in Windows. I guess that caching the conda environment decreases the runtime by 5 minutes at most (although most jobs run in parallel).

Let's keep the cache for the time being. It might not make a huge difference, but it's still nice to keep our runtime down and decrease the load on their machines.

Also, if the goal is to find as soon as possible that there has been a breaking change with a dependency, then I think we could consider adding a test that is triggered by a cron schedule (every night or something).

Huh. It's probably fine to just find out the next time that a build happens. But if this is actually easy to do, then perhaps once weekly would be a useful thing, for times when I'm not actively working in the code? Once per day seems excessive.

...I checked the CHANGELOG before doing so, and it did not seem to affect any of the flags we were using.

Thanks for your diligence!

Sorry again for not providing enough context. The commit 4bcc36a you made is only strictly needed when you are going to install gcc-11. In my preliminary tests, I found that adding the ppa:ubuntu-toolchain-r/test repository makes the next step, installing the required gcc version, slower. Then, what I do is just to add the repository if gcc-11 is actually gonna be installed. It is a micro-optimization, though. I don't think it improves the runtime by more than a minute.

Ah, I see. Makes sense.

...Thank you for clarifying this. Then, I think it is fine as it is (as separate jobs for GUI and CLI).

OK, good.

So. Given my limited understanding of all this, I need to ask:

  • is this ready to merge, or do you have more you intend to do?
  • are there any drawbacks or problems with what you have done here, compared to the way it was before?
  • are you entirely happy with where this ended up? should I hesitate to merge it, for any reason, in your view?

There are no drawbacks I am aware of, and I think it is ready to merge. I think I am happy with how it is right now. I considered maintaining the old CI for a while, to be extra sure there is nothing wrong. I think it is not a silly idea. It would require just to have both files in the workflow directory.

We can always revert if it turns out there's a problem. I'm not particularly worried about temporary breakage; we're not such a big fish that a day of downtime is a major problem. :->

I'm not sure how the pricing of the GitHub Actions works. I think public repos don't have to pay, but I guess there must be an upper limit. I wonder how far away you are from it. If the number of hours is not an issue, I would also suggest to trigger the CI every night/week automatically and not only with every commit. That would require a minimal chance in the YML file.

I'd say the number of hours is an issue; I think there is a max (although I don't think I've ever hit it), but it's also just a matter of being a good neighbor. It is really awesome that they give us free CI, and I don't want to abuse that.

I hope I was clearer this time!

Yes indeed, thanks!

So I'll just wait to merge for you to give me a final thumbs-up (because maybe you want to add that weekly cron task), and then I'll merge. :->

It's really great to have all this cleanup and improvement, in addition to fixing the focal issue. I really appreciate it. If CI issues come up in future, is it OK if I ping you again? (The CI for SLiM really doesn't change very much, beyond moving to new platforms as time passes.)

currocam commented 1 month ago

Hi again, I added the cron schedule trigger. I guess I will see next Monday if it actually runs automatically or not.

Sure, ping me again if it fails :) I agree it is quite nice that they offer free CI and on different machines. It seems like the they offer 2000 free minutes per month.

Edit: Forget about the estimation I did. I am actually quite confused about how they calculate the number of minutes and if all platforms add the same.

I'm glad you find useful what I did!

Cheers, Curro

petrelharp commented 1 month ago

Thank you SO MUCH!