ClusterMonteCarlo / CMC-COSMIC

14 stars 19 forks source link

Running CMC-COSMIC on Ubuntu #40

Open Jay13inspace opened 2 years ago

Jay13inspace commented 2 years ago

Hi,

I am working with Jan (@JJEldridge) on getting the CMC-COSMIC code running. I was originally receiving the same error as Jan, (https://github.com/COSMIC-PopSynth/COSMIC/issues/552) but managed to fix this by taking the dynamics_apply() function and moving it into the main file where it is being called. This fixed the original segmentation fault I was receiving while running the Plummer Sphere simulation. I was unfortunately not able to find a way to have the function outside of the main file and have been running all the current simulations with this function in there.

I am able to run the King Profile simulation with 60,000 stars but I get the segmentation fault again when attempting to run with 70,000 stars.

I am running this on Ubuntu 20.04 with 8GB memory, 4GB swap memory, and 8 cores.

Is there any way of getting this to run on this system without inserting the dynamics_apply() function into the main file? Or does it require a higher power computer?

I am using the line "mpirun -np 4 ../CMC/bin/cmc KingProfile.ini king" to run the program.

Thanks, Jason Sampson

carlrodriguez commented 2 years ago

Hi Jason,

The compilation working with dynamics_apply in the main file but not in cmc_dynamics.c is extremely odd to me. This sounds like some sort of linker issue. We haven't tested the code on an Ubuntu linux distro (just the Red Hat that many HPC systems use and a mac).

What C compiler are you using with CMAKE?

As for the segfault, this could be a memory issue. Can you try reducing the number of cores (to -np 2) and see if you can run with 70,000 stars?

Carl

Jay13inspace commented 2 years ago

Hi Carl,

I realize I didn't explain myself the best earlier. I had to put the dynamics_apply() function directly into the main function code where it was being called.

if (PERTURB > 0) {
        long j, si, p=AVEKERNEL, N_LIMIT, k, kp, ksin, kbin;
        ...
        break_wide_binaries(curr_st);
}
timeEndSimple(tmpTimeStart, &t_dyn);

When I moved it into the main file, below the main function, I had the same issue as earlier. The fault only resolved with the code from the function being inserted directly into the main function.

I am using the GCC version 4:9.3.0-1ubuntu2 compiler with CMAKE.

I have tried to reduce the cores for the 70,000 stars run through and it still gets an immediate segfault.

Thanks, Jason Sampson

carlrodriguez commented 2 years ago

Hi Jason,

Ok this is an unusual problem, which to me sounds like a linker issue. What version of CMAKE are you using, and if it's not a recent one can you try updating it?

Carl

zhang2023-byte commented 9 months ago

I met the same question when I try to run CMC on Ubuntu(The errors in my program are as follows"):

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun noticed that process rank 2 with PID 0 on node Savior exited on signal 11 (Segmentation fault).

I didn't understand why the original poster's operation fixed the problem. Is this an issue with the Ubuntu system? How can it be fixed?

giulianoiorio commented 2 months ago

Hello, I am experiencing a similar problem using directly the Docker image of CMC. When I try to run both the Plummer and King examples I trigger a segfault error both using the serial version of CMC and parallel one (with a number of processes ranging from 2 to 4).

@carlrodriguez I wonder if you can replicate this error using the Docker images in one of your machines

zhang2023-byte commented 2 months ago

Hi, I solved my issue by this way https://github.com/ClusterMonteCarlo/CMC-COSMIC/issues/54#issuecomment-2025220333, I suspect this might help you.

carlrodriguez commented 2 months ago

Thank you @zhang2023-byte for providing that! Yes right now most large runs need to be run with ulimit -s unlimited as well as the linux command mentioned in that comment. This is because the allocation of the memory for the hdf5 files is currently done in the stack instead of the heap (where it would not run into kernel memory issues).

I'm hoping to take a look at that in the next week and issue a PR to fix it.

FYI, the Docker image has not been updated in quite some time, and we may drop support for it soon. However I will also look at that and see if we can update it for the time being.

giulianoiorio commented 2 months ago

Thank @zhang2023-byte for the suggestion, this fixes the issue.

@carlrodriguez this confirms that was a general issue and not particularly related to the Docker image. By the way, I think the Docker image could be very useful and it does not require frequent updates. The current one already includes everything needed to install and use CMC, the most updated version of the code can be easily get and installed in the container by simply pulling it from the git repository. Thank you!