Ascent hanging with MPI

FrankFrank9 commented 2 months ago

Hello,

I have the issue where ascent hangs my simulation when running with MPI on multiple cluster nodes.

I compile with: env enable_mpi=ON ./build_ascent.sh

Did this ever happened before? Do you have any recommendation?

Best

cyrush commented 2 months ago

@FrankFrank9 can you share what actions you are using?

Are you passing the mpi comm handle id as an option during ascent.open()?

FrankFrank9 commented 2 months ago

Hi @cyrush, thanks for reaching out.

I'm using PyFR where ascent_mpi.so is wrapped with Ctypes. The ascent wrappers are in the plugin directory in ascent.py. The simulation hangs only when multiple rank/nodes are required to perform a render operation. I believe there's something wrong in the way I compile and link to my MPI implementation on the cluster. The actions I use are qcriterion, iso-values and pseudocolor.

Are you passing the mpi comm handle id as an option during ascent.open()?

In PyFR this is the mpi passed to ascent self.ascent_config['mpi_comm'] = comm.py2f() where self.ascent_config is a Ctypes wrapper around conduit.so

Any possible idea?

Also does it matter specifying the vtkm backend? In PyFR currently there's: self.ascent_config['runtime/vtkm/backend'] = vtkm_backend with a 'serial' default value

Best regards

cyrush commented 2 months ago

@FrankFrank9

We do provide and test python modules for ascent and conduit. There are some extra checks in there with respect to MPI vs non mpi, however since you are directly using ascent_mpi there should not be a confusion there.

The backend should not matter.

It is possible you have an error on an mpi task and we aren't seeing it.

Can you try the following:

self.ascent_config['exceptions'] = "forward";

That will allow exceptions to flow up, and will likely crash instead of hang.
If that happens we know we have an error case vs an algorithm hang.

When compiling ascent with enable_mpi=ON it's important for the same modules you will use with PyFR to be loaded.

FrankFrank9 commented 2 months ago

@cyrush

self.ascent_config['exceptions'] = "forward";

I tried it but I don't get exceptions. Is it the correct syntax to impose?

When compiling ascent with enable_mpi=ON it's important for the same modules you will use with PyFR to be loaded.

I did that but the hanging still persists and I also tried installing with enable_find_mpi="${enable_find_mpi:=ON}" but nothing changed.

An update on this: the hanging is due to the execute function, specifically self.lib.ascent_execute(self.ascent_ptr, self.actions)

cyrush commented 2 months ago

Sorry that did not help us. Can you share your actions?

Also, can you try running a very simple action:

-
  action: "save_info"

This will create an yaml file (if successful) that might help us.

FrankFrank9 commented 2 months ago

@cyrush

Thanks a lot for the help you're providing.

After save_info the last .yaml, before hanging, given by Ascent is the one attached out_ascent.txt

Hope this can be useful. Even if exceptions forward is active nothing happens and no errors are thrown. (On 1 rank locally exceptions are correctly forwarded)

FrankFrank9 commented 1 month ago

@cyrush An update on this: the hanging manifest only on multi-node runnings. Any idea on where to look at? I don't have the opportunity to test elsewhere unfortunately

cyrush commented 1 month ago

@FrankFrank9

Sorry this mystery continues. I see some nans in the some of our camera info outputs - but I don't think that would be the source of a hang.

When it hangs, do you get any of the three images you are trying to render (trying to narrow down where to look). The contour if the q-crit is the most complex pipeline.

Can you share how many MPI tasks + job nodes?

Can we coach you to a set of HDF5s out via an Ascent extract and see if we can reproduce?

FrankFrank9 commented 1 month ago

@cyrush

Can you share how many MPI tasks + job nodes?

It is 80 MPI tasks over 10 nodes. But this happens whenever I use more than 1 node.

When it hangs, do you get any of the three images you are trying to render (trying to narrow down where to look). The contour if the q-crit is the most complex pipeline.

To be honest it seems more or less random. But I noticed that it is less frequent of those scenes when only 1 render is called. Is there any MPI blocking operation when multiple renderers are triggered on a scene?

Can we coach you to a set of HDF5s out via an Ascent extract and see if we can reproduce?

Yes sure, let me know

nicolemarsaglia commented 1 month ago

@FrankFrank9

Here is an example ascent_actions.yaml for generating an extract of the data.

-
  action: "add_extracts"
  extracts:
    e1:
      type: "relay"
      params:
        path: "your_name_for_extract"
        protocol: "blueprint/mesh/hdf5"

This should generate a root file and a folder of hdf5 files (or just the root file if small enough). Then we can hopefully use this extract to replicate your error.

Alpine-DAV / ascent

Ascent hanging with MPI #1384