dnarayanan / powderday

powderday dust radiative transfer
BSD 3-Clause "New" or "Revised" License
22 stars 16 forks source link

Enzo segfaults with Hyperion #98

Open anchwr opened 3 years ago

anchwr commented 3 years ago

I'm a new user and hoping to use powderday on simulations run with Enzo. I was able to run powderday on the example gadget Milky Way zoom and to reproduce the plots from the documentation, but the code segfaults whenever I try to run it on EnzoDisk. This seems to be happening not long after hyperion starts up - I noticed there's already an open issue regarding some incompatibility between Enzo, hyperion, and yt4.x, so perhaps this is part of that? This is the output from pd_front_end.py after hyperion starts up:

` Hyperion v0.9.1 Started on 21 September 2020 at 11:04:31 Input: /Users/anna/Documents/FOGGIEstuff/powderday/outputs/example.0030.rtin.sed Output: /Users/anna/Documents/FOGGIEstuff/powderday/outputs/example.0030.rtout.sed

[main] using random seed = -5515 [dust] reading dust_001 [setup_grid_geometry] Reading AMR cartesian grid [grid_physics] reading density grid [grid_physics] applying mask to density grid [grid_physics] reading minimum_specific_energy [grid_physics] checking energy_abs WARNING: specific_energy below minimum allowed in some cells - resetting [update_energy_abs] [grid_physics] updating energy_abs_tot [sources] setting up sources [main] starting Lucy iteration 1 [grid_physics] pre-computing jnu_var

    # Photons    CPU time (sec)    Photons/sec  
  ----------------------------------------------

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

0 0x106881302

1 0x106881ace

2 0x7fff6bbd4b5c

3 0x106096fdc

4 0x106096d63

5 0x106096e25

6 0x106096b26

7 0x1060acdc1

8 0x1060de04e

9 0x1060e1722

10 0x1060e231a

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 2747 RUNNING AT Raisa.local = EXIT CODE: 11 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault: 11 (signal 11) This typically refers to a problem with your application. Please see the FAQ page for debugging suggestions Run did not complete successfully: output file appears to be corrupt An error occurred, and the run did not complete `

The hash I'm using is: 03e32b134ed6631c5dd495c7b5d4ec91eb4d3323 Here are my parameter files: http://paste.yt-project.org/show/277/ & http://paste.yt-project.org/show/278/ (I used the example parameter files from the gadget/mw_zoom folder and just changed a few parameters, so this is definitely another potential source of the problem)

Thank you for your help!

dnarayanan commented 3 years ago

oh no - thanks for posting the issue! our enzo development is the newest for sure (and i'll pipe @jwise77 and @snigdaa into this conversation just so they can follow along, and possibly pipe in in case they've seen any of this before), so it's totally possible that there's issues we haven't yet run into.

the first thing i'll ask is - have you had a chance to check to see if you can get the example in tests/SKIRT/enzo_disk to run?

https://github.com/dnarayanan/powderday/tree/master/tests/SKIRT/enzo_disk

this may help. if this also segfaults, then the next step to try would be to revert your yt to the last stable yt3.x hash by doing (in your yt source directory):

git checkout last_yt3

and recompile and try again. I can't remember unfortunately what the issue was with the yt4.x and enzo (and unfortunately my description in the issue is lacking detail...sigh) but i'll try to understand that as well.

anyways please do update if either of the aforementioned can evolve the issue forward at all.

anchwr commented 3 years ago

Thank you so much! I hadn't thought to check the SKIRT directory for example Enzo parameter files. The SKIRT test ran fine! It looks like grid positions are specified in code units for the Enzo model file and physical units for the gizmo model file - is that right? If so, that would probably explain why my earlier runs were segfaulting!

dnarayanan commented 3 years ago

this is a good reason for me to nuke the examples directory. it doesn't provide any new information at best, and at worst sends users down the wrong path! its an artefact of an older version of the code and i'll remove it.

in the mean time though - all code positions in principle are supposed to be code units! please do alert me if you find an issue in which this doesn't appear to be the case.

in the mean time, i'll leave this issue open until you can confirm that the code runs okay on your FOGGIE simulations as well as i'm eager to ensure that everything behaves as it should!

anchwr commented 3 years ago

That makes sense. Thank you - I really appreciate the help! I tried running with one of the FOGGIE galaxies today and encountered a different error when Hyperion started up:

Hyperion v0.9.1 Started on 22 September 2020 at 22:05:17 Input: /Users/anna/Documents/FOGGIEstuff/powderday/outputs/Tempest.0042.rtin.sed Output: /Users/anna/Documents/FOGGIEstuff/powderday/outputs/Tempest.0042.rtout.sed

[main] using random seed = -6419 [dust] reading dust_001 [setup_grid_geometry] Reading AMR cartesian grid

ERROR : Grids 1 and 47 in level 12 have edges that are not separated by an integer number of cells in the z direction WHERE : setup_grid_geometry

Execution aborted on 22 September 2020 at 22:06:41

Run did not complete successfully: output file appears to be corrupt An error occurred, and the run did not complete

The hash I'm using is: 028051e5859b112961346cfddf675cacdfd0664b Here are my parameter files: http://paste.yt-project.org/show/280/ & http://paste.yt-project.org/show/281/ (this time based on the enzo parameter files in tests/SKIRT/enzo_disk)

dnarayanan commented 3 years ago

Thanks for sending this - I've never quite seen it before. I was wondering if you could please paste two outputs:

[1] the entire output (i.e. not just the last bit) [2] would it be possible at all to share the actual HDF5 simulation somewhere? one possibility is to upload it via:

yt upload filename

if you don't want to share publicly, you could email me a link at desika.narayanan at gmail . One thing I'm wondering is if there's some issue with the box size and the center of the simulation.

jwise77 commented 3 years ago

@snigdaa was seeing the same error message in a cosmological simulation, but only for a particular halo and not the rest. My best guess is that there's a round-off issue with the grid boundaries, but I don't know where that'd happen.

snigdaa commented 3 years ago

Yes, I did! I can send you that output file as well if you would like. I haven't been able to resolve the grid separation issue

dnarayanan commented 3 years ago

Thanks @jwise77 and @snigdaa ! Yes @snigdaa if you can please send an output that would be awesome. At the least, it would be good to have a record here to see if this issue crops up very often, and if the errors are always the same.

anchwr commented 3 years ago

It's a relief that someone else has seen this error! I'm in the process of uploading the snapshot, but, in the meantime, here's the full output from powderday: http://paste.yt-project.org/show/282/ . I'll see if I can try another halo.

anchwr commented 3 years ago

I ran powderday on a slightly earlier snapshot of the same halo and encountered the grid separation error again (although the grid numbers listed were not the same). It ran just fine on a snapshot from a lower resolution run of the halo, though!

dnarayanan commented 3 years ago

oh interesting - that is a clue for sure. i'm having trouble getting all the files in a single download in the drive link you shared - i hate to ask, but do you mind zipping or tarrnig all the files into a single file ? that might make download/transfer easier.

anchwr commented 3 years ago

Sure! Sorry - I should have done that to begin with. I just sent you a link.

anchwr commented 3 years ago

I came across this bug again when running some earlier timesteps of a simulation for which the z=0 timestep works, so I decided to take a closer look at it. As the error message suggests, it's happening when hyperion sanity checks the grid that's been read in and I agree with @jwise77 that it's a round-off error. The tolerance level for a pass is <1e-8 and the simulation that I originally saw the bug in fails with 1.08e-8. However, all of the variables and constants that are being used in this calculation are already double precision. We could potentially fix this by switching the grid coordinates over to quad precision, but, while the sanity checker itself is in Fortran, the conversion of the grid coordinates from code units to cm (which is what the sanity checker uses) happens in python and I'm unsure of how to get that level of precision in python without installing something like bigfloat.

On an individual user level, this bug can be fixed by increasing the tolerance. The round-off error seems to get worse with progressively more refined grids, so, for now, I've changed the tolerance (which appears in ln 190 of grid_geometry_amr.f90) in my local copy of hyperion to 1e-5.

However, for a grid that's being read in from a simulation that's already been run, this (and the other grid sanity checks) is unnecessary. If the grids didn't line up correctly, the simulation wouldn't have run. Accordingly, it seems like a more appropriate fix might be to add some kind of flag to the call to run the model that causes these tests to be skipped if a grid has been read in from a simulation.

In any case, I don't think it's possible to fix this entirely from the powderday side. Do you have any thoughts on what the best course of action would be?

Side note: in order to do some debugging and change the tolerance, I had to install hyperion from source so that I could access the Fortran code and I noticed that the hyperion github that's linked to in the powderday installation documentation is an old version of the repo (https://github.com/astrofrog/hyperion.git rather than the newer https://github.com/hyperion-rt/hyperion.git) that lacks a number of the functions powderday uses, like to/from_yt.