Problem while trying to run the short example of AbacusHOD

abacusorg / abacusutils

Python code to interface with halo catalogs and other Abacus N-body data products

https://abacusutils.readthedocs.io

GNU General Public License v3.0

14 stars 14 forks source link

Problem while trying to run the short example of AbacusHOD #144

Open MinaEnceladus opened 3 months ago

MinaEnceladus commented 3 months ago

Hi,

I'm running AbacusHOD through the new BinderHub.

First, I tried to run the first part of the process, running the prepare_simcode for z=0.500.

The first time, it took a few hours to reach slab number 33, producing two output files: halos_xcom_32_seed600_abacushod_oldfenv_new.h5 particles_xcom_32_seed600_abacushod_oldfenv_new.h5

Next time, slab 31 and: halos_xcom_30_seed600_abacushod_oldfenv_new.h5 particles_xcom_30_seed600_abacushod_oldfenv_new.h5

I also repeated for z = 0.200 and 0.100.

Now, when I run the short example, I receive this error:

FileNotFoundError: [Errno 2] Unable to synchronously open file (unable to open file: name = '.../output/subsamples/AbacusSummit_base_c000_ph000/z0.100/halos_xcom_0_seed600_abacushod_oldfenv_new.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

Also, it creates empty folders in the output directory for galaxies. .../output/galalxies/AbacusSummit_base_c000_ph000/z0.500

lgarrison commented 3 months ago

Does the halos_xcom_0_seed600_abacushod_oldfenv_new.h5 file exist somewhere? @SandyYuan can confirm, but I think that file should be produced by prepare_sim. There should probably be files named halos_xcom_0_... through halos_xcom_33_.... If not, it may mean that prepare_sim didn't run correctly or ran out of memory.

epaillas commented 3 months ago

If it is of any help, I had similar issues when trying to run prepare_sim for z = 0.5 periodic boxes a few weeks ago. The problem was that the script was configured to load 3 slabs in parallel, which ended up requiring too much memory and it would not correctly generate the output files (as Lehman says, it should generate 34 files halos_xcomi... with i running from 0 to 33.

I switched

prepare_sim:
    Nparallel_load: 2

in the yaml configuration file and this brought down the memory consumption to something that was manageable for NERSC and solved the problem. Not sure what number will be adequate for the cluster you're using.

(I'm having similar issues with the lightcone mocks as we are discussing in the other thread, but in that case even Nparallel_load: 1 won't do the trick. However, for periodic boxes I found tweaking this parameter was enough).

MinaEnceladus commented 3 months ago

Thanks, @lgarrison and @epaillas.

You're right. It appears that the system ran out of memory. I don't have access to NERSC or any other cluster. I used the binder, and I only have 128 GB of memory.

I also checked z = 0.100 once with Nparallel_load: 2 and again with Nparallel_load: 1. In the second attempt, after more than 4 hours, only a few files were produced (0, 9, 18, and 27).

lgarrison commented 3 months ago

I wonder if there could be a CPU problem, too. Binder is a bit strange in that it looks to applications as if they have 96 cores, but really they're sharing 4 (cgroups). You might want to set nthreads = 4 here: https://github.com/abacusorg/abacusutils/blob/6f8098cdfd7f9eb558eac13e13beba06c8696e65/abacusnbody/hod/prepare_sim.py#L1098 (We should make this an parameter, I'll open an issue)

If memory is the problem, though, then this might not help. The base simulations are big, unfortunately! You might want to try a smaller simulation if your application allows. hugebase is often a good place to start, because it's the same volume but lower mass resolution.

MinaEnceladus commented 3 months ago

Thanks @lgarrison. I've managed to successfully run "prepare_sim" and the Short Example.

sim_name: 'AbacusSummit_hugebase_c000_ph000' z_mock: 0.100 Nparallel_load: 1