ECP-copa / Cabana

Performance-portable library for particle-based simulations
Other
193 stars 51 forks source link

Disable HDF5 in nightly build for now #626

Closed streeve closed 1 year ago

streeve commented 1 year ago

Issues with consistent hang & timeout: https://github.com/ECP-copa/Cabana/actions/runs/4975748365

streeve commented 1 year ago

@brtnfld any ideas about the builds hanging on Fedora linked above? Only happens with the HDF5 test and MPI>1

brtnfld commented 1 year ago

I'm not sure. Is there a way to get more verbose output?

streeve commented 1 year ago

I'm not sure. Is there a way to get more verbose output?

I can try once we merge #628

streeve commented 1 year ago

@brtnfld I got more verbose output here: https://github.com/ECP-copa/Cabana/actions/runs/5002925248/jobs/8963496148

brtnfld commented 1 year ago

It is stuck in building Cabana with no output. I don't see an issue in the completed openmpi run.

streeve commented 1 year ago

@junghans @sslattery any ideas here? May disappear eventually with a fix on openmpi-devel

e54771bc9272:rank0.Cabana_HDF5ParticleOutput_MPI_test_SERIAL: Failed to get eth0 (unit 0) cpu set
24: e54771bc9272:rank0: PSM3 can't open nic unit: 0 (err=23)
24: e54771bc9272:rank0: PSM3 can't open nic unit: 0 (err=23)e54771bc9272:rank0.Cabana_HDF5ParticleOutput_MPI_test_SERIAL: Failed to get eth0 (unit 0) cpu set
24: 
24: e54771bc9272:rank0.Cabana_HDF5ParticleOutput_MPI_test_SERIAL: Failed to get eth0 (unit 0) cpu set
24: e54771bc9272:rank0: PSM3 can't open nic unit: 0 (err=23)
24: e54771bc9272:rank1.Cabana_HDF5ParticleOutput_MPI_test_SERIAL: Failed to get eth0 (unit 0) cpu set
24: e54771bc9272:rank1: PSM3 can't open nic unit: 0 (err=23)
24: e54771bc9272:rank1.Cabana_HDF5ParticleOutput_MPI_test_SERIAL: Failed to get eth0 (unit 0) cpu set
24: e54771bc9272:rank1: PSM3 can't open nic unit: 0 (err=23)
24: e54771bc9272:rank1.Cabana_HDF5ParticleOutput_MPI_test_SERIAL: Failed to get eth0 (unit 0) cpu set
24: e54771bc9272:rank1: PSM3 can't open nic unit: 0 (err=23)
24: e54771bc9272:rank0.Cabana_HDF5ParticleOutput_MPI_test_SERIAL: Failed to get eth0 (unit 0) cpu set
24: e54771bc9272:rank0: PSM3 can't open nic unit: 0 (err=23)
24: --------------------------------------------------------------------------
24: Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
24: unusual; your job may behave unpredictably (and/or abort) after this.
24: 
24:   Local host: e54771bc9272
24:   Location: mtl_ofi_component.c:509
24:   Error: Invalid argument (22)
24: --------------------------------------------------------------------------
streeve commented 1 year ago

@junghans after looking one more time with no fix I think we may as well disable this for the moment. I'm not sure why HDF5 is being built in anyway since it should require being explicitly enabled

streeve commented 1 year ago

Missing Jenkins runs unrelated