Open FrankFrank9 opened 2 months ago
@FrankFrank9 can you share what actions you are using?
Are you passing the mpi comm handle id as an option during ascent.open()
?
Hi @cyrush, thanks for reaching out.
I'm using PyFR where ascent_mpi.so
is wrapped with Ctypes. The ascent wrappers are in the plugin
directory in ascent.py
.
The simulation hangs only when multiple rank/nodes are required to perform a render operation. I believe there's something wrong in the way I compile and link to my MPI implementation on the cluster.
The actions I use are qcriterion, iso-values and pseudocolor.
Are you passing the mpi comm handle id as an option during ascent.open()?
In PyFR this is the mpi passed to ascent
self.ascent_config['mpi_comm'] = comm.py2f()
where
self.ascent_config
is a Ctypes wrapper around conduit.so
Any possible idea?
Also does it matter specifying the vtkm backend?
In PyFR currently there's:
self.ascent_config['runtime/vtkm/backend'] = vtkm_backend
with a 'serial' default value
Best regards
@FrankFrank9
We do provide and test python modules for ascent and conduit. There are some extra checks in there with respect to MPI vs non mpi, however since you are directly using ascent_mpi
there should not be a confusion there.
The backend should not matter.
It is possible you have an error on an mpi task and we aren't seeing it.
Can you try the following:
self.ascent_config['exceptions'] = "forward";
That will allow exceptions to flow up, and will likely crash instead of hang.
If that happens we know we have an error case vs an algorithm hang.
When compiling ascent with enable_mpi=ON
it's important for the same modules you will use with PyFR to be loaded.
@cyrush
self.ascent_config['exceptions'] = "forward";
I tried it but I don't get exceptions. Is it the correct syntax to impose?
When compiling ascent with enable_mpi=ON it's important for the same modules you will use with PyFR to be loaded.
I did that but the hanging still persists and I also tried installing with enable_find_mpi="${enable_find_mpi:=ON}"
but nothing changed.
An update on this: the hanging is due to the execute function, specifically self.lib.ascent_execute(self.ascent_ptr, self.actions)
Sorry that did not help us. Can you share your actions?
Also, can you try running a very simple action:
-
action: "save_info"
This will create an yaml file (if successful) that might help us.
@cyrush
Thanks a lot for the help you're providing.
After save_info
the last .yaml
, before hanging, given by Ascent is the one attached
out_ascent.txt
Hope this can be useful. Even if exceptions forward is active nothing happens and no errors are thrown. (On 1 rank locally exceptions are correctly forwarded)
@cyrush An update on this: the hanging manifest only on multi-node runnings. Any idea on where to look at? I don't have the opportunity to test elsewhere unfortunately
@FrankFrank9
Sorry this mystery continues. I see some nans in the some of our camera info outputs - but I don't think that would be the source of a hang.
When it hangs, do you get any of the three images you are trying to render (trying to narrow down where to look). The contour if the q-crit is the most complex pipeline.
Can you share how many MPI tasks + job nodes?
Can we coach you to a set of HDF5s out via an Ascent extract and see if we can reproduce?
@cyrush
Can you share how many MPI tasks + job nodes?
It is 80 MPI tasks over 10 nodes. But this happens whenever I use more than 1 node.
When it hangs, do you get any of the three images you are trying to render (trying to narrow down where to look). The contour if the q-crit is the most complex pipeline.
To be honest it seems more or less random. But I noticed that it is less frequent of those scenes when only 1 render is called. Is there any MPI blocking operation when multiple renderers are triggered on a scene?
Can we coach you to a set of HDF5s out via an Ascent extract and see if we can reproduce?
Yes sure, let me know
@FrankFrank9
Here is an example ascent_actions.yaml for generating an extract of the data.
-
action: "add_extracts"
extracts:
e1:
type: "relay"
params:
path: "your_name_for_extract"
protocol: "blueprint/mesh/hdf5"
This should generate a root file and a folder of hdf5 files (or just the root file if small enough). Then we can hopefully use this extract to replicate your error.
Hello,
I have the issue where ascent hangs my simulation when running with MPI on multiple cluster nodes.
I compile with:
env enable_mpi=ON ./build_ascent.sh
Did this ever happened before? Do you have any recommendation?
Best