Open ACDylan opened 3 years ago
can you run powderday on this snapshot interactively? does it hang at some point if you do?
By 'interactively', you mean running with the terminal console and not a job?
If so, it blocked my terminal after the first line
Interpolating (scatter) SPH field PartType0: 0it [00:00, ?it/s]
, the latter running indefinitely.
hmm interesting. how many particles is the snapshot? this seems to be hanging in yt (though I've never seen it take 3 days to deposit the octree before).
in a terminal how long does this take to finish running (i.e. does it ever finish?)
import yt
ds = yt.load(snapshotname)
ad = ds.derived_field_list
PartType0: 13,870,234 PartType1: 10,000,000 PartType2: 10,000,000 PartType3: 1,250,000 PartType4: 1,584,425
>>> ad = ds.derived_field_list
yt : [INFO ] 2021-09-27 22:18:46,988 Allocating for 3.670e+07 particles
yt : [INFO ] 2021-09-27 22:18:46,988 Bounding box cannot be inferred from metadata, reading particle positions to infer bounding box
yt : [INFO ] 2021-09-27 22:18:50,997 Load this dataset with bounding_box=[[-610.44433594 -612.21533203 -614.03771973], [616.07244873 612.08428955 614.15777588]] to avoid I/O overhead from inferring bounding_box.
Loading particle index: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53/53 [00:00<00:00, 371.52it/s]
Around a second to load it. I can try again to run a simulation in a terminal.
Edit: Maybe this is coming from
yt : [INFO ] 2021-09-20 22:32:40,241 Octree bound 31193650 particles
I don't know why is there so much particles. At least, gizmo's snapshot simulation have around 1 million Octree particles, here it is 31M.
My lab gave me a zoom-in simulation (while previous simulation is still processing, I have increased the number of cores) and as you can see, it also takes a lot of time for the interpolation.
I'll keep you informed!
are there any updates for this, or shall I close the issue?
Hi @ACDylan and @dnarayanan, I'm trying to run Powderday on Gadget-4 HDF5 snapshots and I've got the same issue, was there a solution for this?
Hi - hmmm no I never heard from @ACDylan again so I'm not sure what the issue is.
@aussing do you have a snapshot that you can easily share so that I can play with it and see if I can get to the bottom of this? also please let me know what powderday and yt hash you're on.
thanks!
Here is a dropbox link to the snapshot file: https://www.dropbox.com/s/54d8hlu54ojf16d/snapshot_026.hdf5?dl=0 It's 5.7GB, but I can find a smaller snapshot file if need be.
The Powderday hash is 2395ae703e9952111bc99542f0cd14a18590fd50,
and I installed yt through conda, I'm using version 4.0.5 and the build number is py38h47df419_0. To get a hash I used conda list --explicit --md5
which returned df416a6d0cabb9cc483212f16467e516
Hi @dnarayanan, I've discovered something that may or may not be related but running Powderday on our HPC system with Slurm only runs on 1 CPU, even when I requested 16 and specified 16 in the Parameters_master
file
Hi - I'm guessing that this actually has to do with how this is being called on your specific system.
are you setting 16 as n_processes
or n_MPI_processes
? it looks like it's getting stuck in a pool.map stage, which would correspond to the former.
Both were set to 16
Hi,
I wonder if the issue is actually how you're calling the slurm job. Here'a an example for a job where I'm calling 32 pool, 32 MPI:
#! /bin/bash
#SBATCH --account narayanan
#SBATCH --qos narayanan-b
#SBATCH --job-name=smc
#SBATCH --output=pd.o
#SBATCH --error=pd.e
#SBATCH --mail-type=ALL
#SBATCH --mail-user=desika.narayanan@gmail.com
#SBATCH --time=96:00:00
#SBATCH --nodes=1
#SBATCH --tasks-per-node=32
#SBATCH --mem-per-cpu=7500
#SBATCH --partition=hpg-default
you may want to contact your sysadmin to find out the best slurm configuration to see if this can be resolved on the side of your HPC.
Hi @dnarayanan, I'm still not sure why the code is only running on one CPU, but as far as the original interpolating issue, I solved it by setting n_ref
to 256 instead of the default 32.
I ran into a separate issue where I got 'WARNING: photon exceeded maximum number of interactions - killing [do_lucy]' in the pd.o file, but I'm able to get around that by setting SED = False
.
Edit: the photon interaction warning seems to come up with several different parameters turned on while keeping SED = False
, I'm trying to track that down at the moment. -> Also setting Imaging = False
Hi - I have my High Resolution "mother simulation" ; however when I runned a snapshot with pdd, it was still running after 3 days. By canceling it, the job script gives me:
The second image being where the simulation stopped.
Is it because of a parameter?