firedrakeproject / firedrake

Firedrake is an automated system for the portable solution of partial differential equations using the finite element method (FEM)
https://firedrakeproject.org
Other
482 stars 157 forks source link

BUG: CheckpointFile mesh saving crash #3632

Open Ainlina opened 1 week ago

Ainlina commented 1 week ago

Describe the bug Saving a mesh to a CheckpointFile fails with PETSc error code 76.

Steps to Reproduce The following code produces the error:

import firedrake as fd

mesh = fd.UnitSquareMesh(10, 10)
with fd.CheckpointFile("test.h5", "w") as h5_handle:
    h5_handle.save_mesh(mesh)

Expected behavior This code worked in older versions of Firedrake; this MWE is based on code I have used successfully in the past and has now broken, although I do not remember exactly which version.

Error message

Traceback (most recent call last):
  File "/root/phd/mwes/checkpoint_file_crash.py", line 5, in <module>
    h5_handle.save_mesh(mesh)
  File "petsc4py/PETSc/Log.pyx", line 188, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "petsc4py/PETSc/Log.pyx", line 189, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "/home/firedrake/firedrake/src/firedrake/firedrake/checkpointing.py", line 606, in save_mesh
    self._save_mesh_topology(tmesh)
  File "petsc4py/PETSc/Log.pyx", line 188, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "petsc4py/PETSc/Log.pyx", line 189, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "/home/firedrake/firedrake/src/firedrake/firedrake/checkpointing.py", line 698, in _save_mesh_topology
    topology_dm.topologyView(viewer=self.viewer)
  File "petsc4py/PETSc/DMPlex.pyx", line 3016, in petsc4py.PETSc.DMPlex.topologyView
petsc4py.PETSc.Error: error code 76
[0] PetscViewerDestroy() at /home/firedrake/petsc/src/sys/classes/viewer/interface/view.c:101
[0] PetscViewerDestroy_HDF5() at /home/firedrake/petsc/src/sys/classes/viewer/impls/hdf5/hdf5v.c:126
[0] PetscViewerFileClose_HDF5() at /home/firedrake/petsc/src/sys/classes/viewer/impls/hdf5/hdf5v.c:107
[0] Error in external library
[0] Error in HDF5 call H5Fclose() Status -1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/phd/mwes/checkpoint_file_crash.py", line 4, in <module>
    with fd.CheckpointFile("test.h5", "w") as h5_handle:
  File "/home/firedrake/firedrake/src/firedrake/firedrake/checkpointing.py", line 535, in __exit__
    self.close()
  File "/home/firedrake/firedrake/src/firedrake/firedrake/checkpointing.py", line 1516, in close
    self.viewer.destroy()
  File "petsc4py/PETSc/Viewer.pyx", line 172, in petsc4py.PETSc.Viewer.destroy
petsc4py.PETSc.Error: error code 76
[0] PetscViewerDestroy() at /home/firedrake/petsc/src/sys/classes/viewer/interface/view.c:101
[0] PetscViewerDestroy_HDF5() at /home/firedrake/petsc/src/sys/classes/viewer/impls/hdf5/hdf5v.c:126
[0] PetscViewerFileClose_HDF5() at /home/firedrake/petsc/src/sys/classes/viewer/impls/hdf5/hdf5v.c:107
[0] Error in external library
[0] Error in HDF5 call H5Fclose() Status -1

Environment:

Output of firedrake-status:

 /home/firedrake/firedrake/bin/firedrake-status:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  __import__('pkg_resources').require('firedrake==0.13.0+6018.g8597ec960')
Firedrake Configuration:
    package_manager: False
    minimal_petsc: False
    mpicc: /home/firedrake/petsc/packages/bin/mpicc
    mpicxx: /home/firedrake/petsc/packages/bin/mpicxx
    mpif90: /home/firedrake/petsc/packages/bin/mpif90
    mpiexec: /home/firedrake/petsc/packages/bin/mpiexec
    disable_ssh: True
    honour_petsc_dir: True
    with_parmetis: False
    slepc: True
    packages: ['git+ssh://bitbucket.org/pefarrell/fascd.git@master#egg=fascd', 'git+ssh://github.com/firedrakeproject/gusto.git@main#egg=gusto', 'git+ssh://github.com/FEMlium/FEMlium.git@main#egg=FEMlium', 'git+ssh://bitbucket.org/pefarrell/fascd.git@master#egg=fascd', 'git+ssh://github.com/thetisproject/thetis#egg=thetis', 'git+ssh://github.com/FEMlium/FEMlium.git@main#egg=FEMlium', 'git+ssh://github.com/firedrakeproject/Irksome.git#egg=Irksome', 'git+ssh://github.com/icepack/icepack.git#egg=icepack', 'git+ssh://github.com/thetisproject/thetis#egg=thetis', 'git+ssh://github.com/firedrakeproject/Irksome.git#egg=Irksome', 'git+ssh://github.com/firedrakeproject/gusto.git@main#egg=gusto', 'git+ssh://github.com/icepack/icepack.git#egg=icepack']
    honour_pythonpath: False
    opencascade: False
    tinyasm: True
    torch: cpu
    petsc_int_type: int32
    cache_dir: /home/firedrake/firedrake/.cache
    complex: False
    remove_build_files: False
    with_blas: None
    netgen: True
Additions:
    None
Environment:
    PYTHONPATH: None
    PETSC_ARCH: default
    PETSC_DIR: /home/firedrake/petsc
Status of components:
---------------------------------------------------------------------------
|Package             |Branch                        |Revision  |Modified  |
---------------------------------------------------------------------------
|FEMlium             |main                          |0a5e69c   |True      |
|FInAT               |master                        |e2805c4   |True      |
|Irksome             |master                        |8cf521e   |True      |
|PyOP2               |master                        |e0a4d3a9  |False     |
|TinyASM             |master                        |015a89a   |True      |
|fascd               |master                        |956cc98   |True      |
|fiat                |master                        |e7b2909   |True      |
|firedrake           |master                        |8597ec960 |False     |
|gusto               |main                          |255dc6f3  |True      |
|h5py                |firedrake                     |4c01efa9  |True      |
|icepack             |master                        |28eed36   |True      |
|libsupermesh        |master                        |84becef   |True      |
|loopy               |main                          |8158afdb  |True      |
|ngsPETSc            |main                          |533574f   |True      |
|pyadjoint           |master                        |2c6614d   |True      |
|pytest-mpi          |main                          |a478bc8   |True      |
|thetis              |master                        |70aa071d  |True      |
|tsfc                |master                        |799191d   |True      |
|ufl                 |master                        |054b0617  |True      |
---------------------------------------------------------------------------

Additional Info The issue also occurs in my older Firedrake Apptainer images on both devices, which are sandboxes and have had other things installed in to them such as an IDE.

ksagiyam commented 1 week ago

As suggested on Slack, this might be an issue of HDF5. I would first try using the older HDF5 on the HPC, and see if it fixes.

cd /your/path
wget https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.12/hdf5-1.12.2/src/hdf5-1.12.2.tar.bz2
export PETSC_CONFIGURE_OPTIONS="--download-hdf5=/your/path/hdf5-1.12.2.tar.bz2"
python3 firedrake-install

Note that due to this issue https://github.com/firedrakeproject/firedrake/issues/3514 you might need to put petsc_options.discard("--download-hdf5") at https://github.com/firedrakeproject/firedrake/blob/cb77d32ac8559d8d57fe0ef0efe251d9cbaf2309/scripts/firedrake-install#L823 in the script before running firedrake-install.

Ainlina commented 1 week ago

I am using the Docker image to avoid any potential install weirdness; is a correct HDF5 supposed to be bundled in there? I don't have root on these machines so can only run firedrake-install inside an Apptainer sandbox which tends to be quite janky and often fail.

ksagiyam commented 1 week ago

The docker image contains the latest HDF5 (1.14). 1.14 seems to work mostly fine, but some users have had to downgrade to 1.12 in their settings for some reason.

Ainlina commented 1 week ago

It appears that this is to do with network file systems; saving to /tmp seems to work, so as a workaround I'll do that then move the file - I don't have access to any local disk storage on my machine because of how my organisation has set it up.