Closed streeve closed 1 year ago
@brtnfld any ideas about the builds hanging on Fedora linked above? Only happens with the HDF5 test and MPI>1
I'm not sure. Is there a way to get more verbose output?
I'm not sure. Is there a way to get more verbose output?
I can try once we merge #628
@brtnfld I got more verbose output here: https://github.com/ECP-copa/Cabana/actions/runs/5002925248/jobs/8963496148
It is stuck in building Cabana with no output. I don't see an issue in the completed openmpi run.
@junghans @sslattery any ideas here? May disappear eventually with a fix on openmpi-devel
e54771bc9272:rank0.Cabana_HDF5ParticleOutput_MPI_test_SERIAL: Failed to get eth0 (unit 0) cpu set
24: e54771bc9272:rank0: PSM3 can't open nic unit: 0 (err=23)
24: e54771bc9272:rank0: PSM3 can't open nic unit: 0 (err=23)e54771bc9272:rank0.Cabana_HDF5ParticleOutput_MPI_test_SERIAL: Failed to get eth0 (unit 0) cpu set
24:
24: e54771bc9272:rank0.Cabana_HDF5ParticleOutput_MPI_test_SERIAL: Failed to get eth0 (unit 0) cpu set
24: e54771bc9272:rank0: PSM3 can't open nic unit: 0 (err=23)
24: e54771bc9272:rank1.Cabana_HDF5ParticleOutput_MPI_test_SERIAL: Failed to get eth0 (unit 0) cpu set
24: e54771bc9272:rank1: PSM3 can't open nic unit: 0 (err=23)
24: e54771bc9272:rank1.Cabana_HDF5ParticleOutput_MPI_test_SERIAL: Failed to get eth0 (unit 0) cpu set
24: e54771bc9272:rank1: PSM3 can't open nic unit: 0 (err=23)
24: e54771bc9272:rank1.Cabana_HDF5ParticleOutput_MPI_test_SERIAL: Failed to get eth0 (unit 0) cpu set
24: e54771bc9272:rank1: PSM3 can't open nic unit: 0 (err=23)
24: e54771bc9272:rank0.Cabana_HDF5ParticleOutput_MPI_test_SERIAL: Failed to get eth0 (unit 0) cpu set
24: e54771bc9272:rank0: PSM3 can't open nic unit: 0 (err=23)
24: --------------------------------------------------------------------------
24: Open MPI failed an OFI Libfabric library call (fi_endpoint). This is highly
24: unusual; your job may behave unpredictably (and/or abort) after this.
24:
24: Local host: e54771bc9272
24: Location: mtl_ofi_component.c:509
24: Error: Invalid argument (22)
24: --------------------------------------------------------------------------
@junghans after looking one more time with no fix I think we may as well disable this for the moment. I'm not sure why HDF5 is being built in anyway since it should require being explicitly enabled
Missing Jenkins runs unrelated
Issues with consistent hang & timeout: https://github.com/ECP-copa/Cabana/actions/runs/4975748365