dnarayanan / powderday

powderday dust radiative transfer
BSD 3-Clause "New" or "Revised" License
22 stars 17 forks source link

Interpolating Issue #146

Open ACDylan opened 3 years ago

ACDylan commented 3 years ago

Hi - I have my High Resolution "mother simulation" ; however when I runned a snapshot with pdd, it was still running after 3 days. By canceling it, the job script gives me:

image image

The second image being where the simulation stopped.

Is it because of a parameter?

dnarayanan commented 3 years ago

can you run powderday on this snapshot interactively? does it hang at some point if you do?

ACDylan commented 3 years ago

By 'interactively', you mean running with the terminal console and not a job? If so, it blocked my terminal after the first line Interpolating (scatter) SPH field PartType0: 0it [00:00, ?it/s], the latter running indefinitely.

dnarayanan commented 3 years ago

hmm interesting. how many particles is the snapshot? this seems to be hanging in yt (though I've never seen it take 3 days to deposit the octree before).

in a terminal how long does this take to finish running (i.e. does it ever finish?)

import yt
ds = yt.load(snapshotname)
ad = ds.derived_field_list
ACDylan commented 3 years ago

PartType0: 13,870,234 PartType1: 10,000,000 PartType2: 10,000,000 PartType3: 1,250,000 PartType4: 1,584,425

>>> ad = ds.derived_field_list
yt : [INFO     ] 2021-09-27 22:18:46,988 Allocating for 3.670e+07 particles
yt : [INFO     ] 2021-09-27 22:18:46,988 Bounding box cannot be inferred from metadata, reading particle positions to infer bounding box
yt : [INFO     ] 2021-09-27 22:18:50,997 Load this dataset with bounding_box=[[-610.44433594 -612.21533203 -614.03771973], [616.07244873 612.08428955 614.15777588]] to avoid I/O overhead from inferring bounding_box.
Loading particle index: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53/53 [00:00<00:00, 371.52it/s]

Around a second to load it. I can try again to run a simulation in a terminal.

Edit: Maybe this is coming from yt : [INFO ] 2021-09-20 22:32:40,241 Octree bound 31193650 particles

I don't know why is there so much particles. At least, gizmo's snapshot simulation have around 1 million Octree particles, here it is 31M.

ACDylan commented 3 years ago

My lab gave me a zoom-in simulation (while previous simulation is still processing, I have increased the number of cores) and as you can see, it also takes a lot of time for the interpolation.

image

I'll keep you informed!

dnarayanan commented 2 years ago

are there any updates for this, or shall I close the issue?

aussing commented 2 years ago

Hi @ACDylan and @dnarayanan, I'm trying to run Powderday on Gadget-4 HDF5 snapshots and I've got the same issue, was there a solution for this?

dnarayanan commented 2 years ago

Hi - hmmm no I never heard from @ACDylan again so I'm not sure what the issue is.

@aussing do you have a snapshot that you can easily share so that I can play with it and see if I can get to the bottom of this? also please let me know what powderday and yt hash you're on.

thanks!

aussing commented 2 years ago

Here is a dropbox link to the snapshot file: https://www.dropbox.com/s/54d8hlu54ojf16d/snapshot_026.hdf5?dl=0 It's 5.7GB, but I can find a smaller snapshot file if need be.

The Powderday hash is 2395ae703e9952111bc99542f0cd14a18590fd50, and I installed yt through conda, I'm using version 4.0.5 and the build number is py38h47df419_0. To get a hash I used conda list --explicit --md5 which returned df416a6d0cabb9cc483212f16467e516

aussing commented 2 years ago

Hi @dnarayanan, I've discovered something that may or may not be related but running Powderday on our HPC system with Slurm only runs on 1 CPU, even when I requested 16 and specified 16 in the Parameters_master file

dnarayanan commented 2 years ago

Hi - I'm guessing that this actually has to do with how this is being called on your specific system.

are you setting 16 as n_processes or n_MPI_processes ? it looks like it's getting stuck in a pool.map stage, which would correspond to the former.

aussing commented 2 years ago

Both were set to 16

dnarayanan commented 2 years ago

Hi,

I wonder if the issue is actually how you're calling the slurm job. Here'a an example for a job where I'm calling 32 pool, 32 MPI:

#! /bin/bash
#SBATCH --account narayanan
#SBATCH --qos narayanan-b
#SBATCH --job-name=smc
#SBATCH --output=pd.o
#SBATCH --error=pd.e
#SBATCH --mail-type=ALL
#SBATCH --mail-user=desika.narayanan@gmail.com
#SBATCH --time=96:00:00
#SBATCH --nodes=1
#SBATCH --tasks-per-node=32
#SBATCH --mem-per-cpu=7500
#SBATCH --partition=hpg-default

you may want to contact your sysadmin to find out the best slurm configuration to see if this can be resolved on the side of your HPC.

aussing commented 2 years ago

Hi @dnarayanan, I'm still not sure why the code is only running on one CPU, but as far as the original interpolating issue, I solved it by setting n_ref to 256 instead of the default 32.

I ran into a separate issue where I got 'WARNING: photon exceeded maximum number of interactions - killing [do_lucy]' in the pd.o file, but I'm able to get around that by setting SED = False.

Edit: the photon interaction warning seems to come up with several different parameters turned on while keeping SED = False, I'm trying to track that down at the moment. -> Also setting Imaging = False