HDFGroup / hdf5

Official HDF5® Library Repository
https://www.hdfgroup.org/
Other
600 stars 243 forks source link

Fix broken Julia CI #4539

Closed derobins closed 1 month ago

derobins commented 4 months ago

The Julia GitHub CI actions have been broken for the past week or two, both in Autotools and CMake. There were no obvious changes that could have caused these failures. They will often pass when re-run.

We will need to investigate why they are failing. Since it's random, it may be a memory issue, either in the Julia wrappers or the HDF5 library.

derobins commented 4 months ago

Sample test failure output:

Test Summary:                      | Pass  Fail  Broken  Total
HDF5.jl                            | 1497     2       3   1502
  plain                            |  151             1    152
  complex                          |   13                   13
  undefined and null               |    4                    4
  abstract arrays                  |    2                    2
  empty and 0-size arrays          |   39                   39
  generic read of native types     |   17                   17
  show                             |   44                   44
  split1                           |   13                   13
  haskey                           |   18                   18
  AbstractString                   |   51                   51
  opaque data                      |    7                    7
  FixedStrings and FixedArrays     |   18                   18
  Object Exists                    |    8                    8
  HDF5 existance                   |    4                    4
  bounds                           |    2                    2
  create_dataset                   |  264                  264
  Strings                          |    8                    8
  h5a_iterate                      |    7     1              8
  h5l_iterate                      |    7     1              8
  h5dchunk_iter                    |    3                    3
  compound                         |   10                   10
  create_dataset (compound)        |    4                    4
  write_compound                   |   27                   27
  custom                           |    6                    6
  reference                        |    6                    6
  null dataspace                   |   13                   13
  scalar dataspace                 |   15                   15
  simple dataspaces                |   98                   98
  BlockRange                       |   42                   42
  hyperslab                        |    6                    6
  Datatypes                        |   15                   15
  hyperslab                        |    5                    5
  read 0-length arrays: issue #859 |                     No tests
  attrs interface                  |   92                   92
  variable length strings          |    1                    1
  readremote                       |   23                   23
  extend                           |   29                   29
  gc                               |  101                  101
  external                         |    6                    6
  swmr                             |    4                    4
  mmap                             |    9                    9
  properties                       |   46             1     47
  filter                           |   80                   80
  Raw Chunk I/O                    |   80                   80
  fileio                           |    6                    6
  track order                      |   18                   18
  h5f_get_dset_no_attrs_hint       |    6                    6
  non-allocating methods           |   11             1     12
  Compression Filter Unit Tests    |    6                    6
  Object API                       |   38                   38
  virtual dataset                  |    5                    5
  mpio                             |    1                    1
ERROR: LoadError: Some tests did not pass: 1[497](https://github.com/HDFGroup/hdf5/actions/runs/9333687611/job/25700081685?pr=4538#step:11:500) passed, 2 failed, 0 errored, 3 broken.
in expression starting at /home/runner/work/hdf5/hdf5/test/runtests.jl:34
ERROR: LoadError: Package HDF5 errored during testing
Stacktrace:
 [1] pkgerror(msg::String)
   @ Pkg.Types /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/Types.jl:55
 [2] test(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; coverage::Bool, julia_args::Cmd, test_args::Cmd, test_fn::Nothing)
   @ Pkg.Operations /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/Operations.jl:1712
 [3] test(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; coverage::Bool, test_fn::Nothing, julia_args::Vector{String}, test_args::Cmd, kwargs::Base.Iterators.Pairs{Symbol, IOContext{Base.PipeEndpoint}, Tuple{Symbol}, NamedTuple{(:io,), Tuple{IOContext{Base.PipeEndpoint}}}})
   @ Pkg.API /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:343
 [4] test(pkgs::Vector{Pkg.Types.PackageSpec}; io::IOContext{Base.PipeEndpoint}, kwargs::Base.Iterators.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:coverage, :julia_args), Tuple{Bool, Vector{String}}}})
   @ Pkg.API /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:80
 [5] test(; name::Nothing, uuid::Nothing, version::Nothing, url::Nothing, rev::Nothing, path::Nothing, mode::Pkg.Types.PackageMode, subdir::Nothing, kwargs::Base.Iterators.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:coverage, :julia_args), Tuple{Bool, Vector{String}}}})
   @ Pkg.API /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:96
 [6] top-level scope
   @ ~/work/_actions/julia-actions/julia-runtest/latest/test_harness.jl:15
 [7] include(fname::String)
   @ Base.MainInclude ./client.jl:444
 [8] top-level scope
   @ none:1
in expression starting at /home/runner/work/_actions/julia-actions/julia-runtest/latest/test_harness.jl:7
Error: Process completed with exit code 1.
derobins commented 4 months ago

@mkitti - Any ideas?

mkitti commented 4 months ago

Could you point me to the CI output?

These both point to issues with the callback mechanism for the iteration functions. I'm not sure which exact test is failing yet though.

mkitti commented 4 months ago

Incidentally, we also seem to be having some issues with Windows builds lately: https://github.com/JuliaPackaging/Yggdrasil/pull/8588

derobins commented 4 months ago

Error output (Autotools) here:

https://github.com/HDFGroup/hdf5/actions/runs/9333687611/job/25707113092

Any recent test failure in HDF5 will likely be a Julia failure.

derobins commented 4 months ago

Could you point me to the CI output?

These both point to issues with the callback mechanism for the iteration functions. I'm not sure which exact test is failing yet though.

Yeah, with the randomness of the error, my guess is that there is some uninitialized memory usage someplace. Maybe -fsanitize=memory on clang would help.

mkitti commented 4 months ago

Yes, I'm noticing the randomness as well. The issue appears to involve an error being thrown within the Julia callback function. The error gets caught by a Julia try-catch and the callback returns -1.

The problem is that after iteration stops, we are not receiving the error code upon return of H5Aiterate2.

The CI test that is failing checks to see that an error is received when the callback throws an error. The test fails because the error is not detected.

The Julia error reference itself is returned via opdata.

mkitti commented 4 months ago

I've preparing to disable the affected tests here: https://github.com/JuliaIO/HDF5.jl/pull/1155

I will merge shortly.

mkitti commented 4 months ago

I have a successful CI run here: https://github.com/JuliaIO/HDF5.jl/actions/runs/9341332687/attempts/1

I'm running it one more time before I merge to make sure that there are no stochastic error nows.