conda-forge / status

https://conda-forge.org/status/
8 stars 13 forks source link

Travis CI linux-ppc64le jobs failing #185

Open mfansler opened 1 month ago

mfansler commented 1 month ago

Travis CI linux-ppc64le jobs have been failing on all Conda Forge R feedstocks. The last successfully passing build was on Mon Jul 30 15:42:45 UTC 2024. Failed logs show emission of Worker Information (the first output), then proceed no further.

mfansler commented 1 month ago

I have been in touch with Travis support via email but no resolution yet.

h-vetinari commented 1 month ago

I've seen the issue on the gtest feedstock as well, independently of R. In any case, I'm not 100% sure this qualifies as "major". According to the status page we build around 100-150x more on azure than on travis, so <0.5% of our builds are affected[^1], and it's possible (at least in principle) to switch them to azure (either emulated or cross-compiled).

I know this is splitting hairs a bit, so no need to change anything per se (I was thinking along the lines of avoiding a "boy who cried wolf" situation where people evenetually don't take our status seriously, but one time isn't going to do that).

That said, thanks a lot for trying to the bottom of this @mfansler! 🙏

[^1]: halved from 1% because aarch builds are still working

jakirkham commented 1 month ago

Was debating between "degraded" and "major outage". Ok with using "degraded' instead

That said, this appears to be affecting all(?) native linux_ppc64le builds and the R migration. So it seemed worthy of it in this case

jakirkham commented 1 month ago

Mervin, have you heard anything from Travis CI?

FWIW it seems Travis users outside conda-forge have the same issue. So it is not just us

mfansler commented 1 month ago

No word since when I created this. I just sent a ping to see if they have any updates.

h-vetinari commented 1 month ago

... https://github.com/conda-forge/conda-forge.github.io/issues/1521 ... 🙄

h-vetinari commented 1 month ago

It's been more than a week. Any affected feedstocks should consider either of the following changes in conda-forge.yml:

dhirschfeld commented 1 month ago

Do you know of any example PR where a recipe was moved to using cross-compilation for linux_ppc64le?

h-vetinari commented 1 month ago

You mean for R or in general?

hmaarrfk commented 1 month ago

xref: https://github.com/conda/conda-build/issues/5349 (just linking here since I tried to move a package out of PPC64le and hit this)

h-vetinari commented 1 month ago

xref: conda/conda-build#5349 (just linking here since I tried to move a package out of PPC64le and hit this)

That should be a very rare case though. Cross-compilation and noarch: python aren't often mixed, because if an output is actually noarch, it suffices to build it just once (e.g. on linux-64).

dhirschfeld commented 1 month ago

You mean for R or in general?

In general. The actual feedstock where I'm hitting this is a go recipe.

mfansler commented 1 month ago

Don't know how much of this will be helpful for other contexts, but here's an example for conversion to cross-compilation on an R feedstock: https://github.com/conda-forge/r-phylobase-feedstock/pull/10

Our recipe (meta.yaml) updates include these changes to build::

For conda-forge.yaml, we use (as already mentioned):

build_platform:
  linux_ppc64le: linux_64
test: native_and_emulated

NB: I usually switch linux_aarch64 to cross-compile as well. If one works they usually both work and the cross-compilation has negligible time difference.

It is not infrequent that we also need to patch the source's build scripts. Since CRAN native builds everything, our upstreams are not always considering cross-compilation, e.g., they use autoconf scripts that include run tests. Often it can be easiest to simply skip such configure scripts and directly provide pre-determined compilation flags.

mfansler commented 1 month ago

It's been more than a week. Any affected feedstocks should consider either of the following changes in conda-forge.yml:

  • moving to cross-compilation (might need recipe changes)
    build_platform:
    linux_ppc64le: linux_64
  • or emulation (much slower, but shouldn't need changes)
    provider:
    linux_ppc64le: azure

Just want to clarify the explicit combinations here:

build_platform provider CI - Build Mode
linux_ppc64le default Travis CI - native
linux_ppc64le azure Azure - emulate (slow!)
linux_64 default Azure - cross-compile (+ emulated tests)
linux_64 azure Azure - cross-compile (+ emulated tests)
minrk commented 1 month ago

I can't find any real competitors to Travis for IBM architectures. But I did find that OSU's Open Source Lab hosts (IBM sponsored) Jenkins instances for ppc and s390x for open source. I'm guessing they are not really prepared to handle conda-forge's scale, but it might be worth a contact in any case.

beckermr commented 1 month ago

Thanks Min. We had access to those for a long time now. Agreed they are not really for our scale.

jakirkham commented 4 weeks ago

Has there been any word from Travis CI on this issue?

mfansler commented 4 weeks ago

Nothing through my email. I am also unable to view the ticket they created (always ask for "Sign-in" then dumps me on the Dashboard). Maybe someone from Core should take over.

jakirkham commented 4 weeks ago

Thanks Mervin! 🙏

Have we seen any Travis CI builds run on linux-ppc64le (including non-R ones)?

mfansler commented 4 weeks ago

Just pinged Support again and they replied promptly that the issue is still active/visible on their and are working on it.

"We are actively investigating this issue. It seems we are getting network related timeouts and we're still troubleshooting on our side about this."

hmaarrfk commented 4 weeks ago

maybe we can default to Cross compilation + Azure and reduce our system usage there to be nice?

jakirkham commented 4 weeks ago

Yes I made this suggestion here: https://github.com/regro/cf-scripts/issues/2930

Edit: On the R side, think Mervin has been using emulation on Azure. Though think there is some work to look at cross-compilation

h-vetinari commented 4 weeks ago

On the R side, think Mervin has been using emulation on Azure. Though think there is some work to look at cross-compilation

He commented further upthread how to cross-compile R recipes.

h-vetinari commented 4 weeks ago

You mean for R or in general?

In general. The actual feedstock where I'm hitting this is a go recipe.

For C/C++ recipes, there's not much to do except change conda-forge.yml and rerender. The compilers populated by {{ compiler("c") }} etc. will automatically get the right target, with the right activation. For CMake builds, it's good to pass $CMAKE_ARGS to the first CMake call, because that contains a bunch of relevant configuration.

The main problem in cross-compilation is that you cannot simply run things (e.g. just-built utilities) during the build process, because the architecture you're compiling for doesn't match what you're running on. That's also why you need the respective dependencies in the build: environment (with # [build_platform != target_platform]). That's where the cross-python and cross-r stuff come in. For python builds, often numpy/cython etc. are necessary, but basically anything else that's needed for stuff to run at build time.

Rust recipes seem to cross-compile without much complications (from a few I've looked at recently), but I'm not familiar with what's necessary for go recipes. The org-wide github-search is very useful for finding this sort of thing though. First impression is that you'll have to pay attention to GOARCH

mfansler commented 4 weeks ago

Yes, I've mostly been avoiding emulation except in a few edge case that would require heavier patching. The issue in R packages is that the R build process can sometimes involve loading the built library (e.g., to render help). In such cases I'll emulate.

h-vetinari commented 3 weeks ago

Seeing aarch builds fail as well now: https://github.com/conda-forge/povray-feedstock/pull/19

jaimergp commented 3 weeks ago

Should we edit the title and issue description to reflect this new information?

mfansler commented 2 weeks ago

@jaimergp there are still other linux-aarch64 jobs passing - I'm not convinced that wasn't a sporadic failure. But if non-R feedstocks are seeing consistent failures, the issue description could be generalized.

mfansler commented 2 weeks ago

Travis CI reports to have resolved the issue and I have confirmed with several jobs that linux-ppc64le runs are indeed running normally again.

jaimergp commented 2 weeks ago

Sounds like we can close this soon, then? Let's keep it open for a few more hours just in case, but will close by EOD if we can confirm it's working.

jaimergp commented 2 weeks ago

Checked https://app.travis-ci.com/github/conda-forge and there are several feedstocks with passing builds for both PPC and ARM from few hours ago (e.g. https://app.travis-ci.com/github/conda-forge/databricks-cli-feedstock/builds/272058545?serverType=git). I'll close. Thanks for keeping an eye on this @mfansler!

jakirkham commented 2 weeks ago

Glad this is improving! 🥳

That said, did just see a new instance of this

So doesn't seem like this is fully resolved yet

h-vetinari commented 2 weeks ago

Looking at the travis dashboard, this still seems to be happening to ~50% of PPC jobs (which just get cancelled).

h-vetinari commented 2 weeks ago

At least that aspect can be cured by restarting the job though.

mfansler commented 2 weeks ago

Yeah, looks like it's back to the previous baseline with something like 10%-25% sporadic failure.