cnr-ibf-pa / hbp-bsp-issues

Ticketing system for developers/testers and power users of the Brain Simulation Platform of the Human Brain Project
4 stars 0 forks source link

Jureca - no space left on device #531

Closed antonelepfl closed 4 years ago

antonelepfl commented 4 years ago

While running some simulations on Jureca using cvsk25 account and booster partition I get:

<PSP:r0000395:shmget(0, sizeof(shm_com_t), IPC_CREAT | 0777) : No space left on device>

on the stdout.

Job ID: ed2bd5a9-5181-43fd-868c-a81e7d328426 Stderr: stderr.txt

BerndSchuller commented 4 years ago

I can't help you with this, please contact sc-support at JSC Tel. +49 2461 - 61 2828, E-mail: sc@fz-juelich.de https://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/support_node.html

ElisabettaGiacalone commented 4 years ago

Hi @antonelepfl , can you retry again? I removed all the "coreneuron_input" folders and unnecessary files from my simulation folders (both in $SCRATCH_cvsk25 and $PROJECT_cvsk25). @pramodk and Jorge can you do the same with your test simulations? Thank you

antonelepfl commented 4 years ago

I paste what we got from JSC

Dear Mr Dietz,
someone of the admins told me:
there were similar issues in the past. The problem occurs because the jobs spawns too many tasks per node.

As first workaround the user could try to set `PSP_ONDEMAND=1` and/or `PSP_SHM=0`

Please try the suggested workarounds.

Regards
Inge Gutheil

I'm trying that work around

antonelepfl commented 4 years ago

So the flags apparently work to remove the No space left on device but now we are getting:

<PSP:r0000644:precon(0x2879b60): write(7, 0x2879a40, 8) : Broken pipe>

I attach an output file: stderr

antonelepfl commented 4 years ago

A directory of the job where I get this error is: /p/scratch/cvsk25/unicore-jobs/d5a8378b-108e-4295-bc9e-378c25626202 Running the input.sh file with my user id antonel1

pramodk commented 4 years ago

@antonelepfl : can you change those unicore-jobs dir read permissions? I see that every time:

[kumbhar1@jrl04 ~]$ cd /p/scratch/cvsk25/unicore-jobs/d5a8378b-108e-4295-bc9e-378c25626202
-bash: cd: /p/scratch/cvsk25/unicore-jobs/d5a8378b-108e-4295-bc9e-378c25626202: Permission denied
BerndSchuller commented 4 years ago

Can you give us (rather, the SC team) a bit more info about what exactly you are running (I can't access the working directory). SC Support mentioned some pip-install script?!

Does the issue also occur if you run the job from an ssh console

cd /p/scratch/cvsk25/unicore-jobs/d5a8378b-108e-4295-bc9e-378c25626202 sbatch bsssubmit*

pramodk commented 4 years ago

@stefanonardo :

simulation finished.  Gather spikes then clean up.
Final report flush starting!
^[[90m[DEBUG] Memusage [MB]: Max=65.41, Min=19.52, Mean(Stdev)=21.03(6.36)^[[39m
Final report flush done!
^[[90m[DEBUG] Memusage [MB]: Max=49.38, Min=19.52, Mean(Stdev)=20.79(4.81)^[[39m
^[[90m[DEBUG] Memusage [MB]: Max=49.38, Min=19.52, Mean(Stdev)=20.79(4.81)^[[39m
setpvec 11                   finished Run 67.66
Using simulation_launch script version 0.19.0

Aren't those just warnings and sim is finishing without any error?

By the way, we are moving away from ParastationMPI with the latest performance issues that we have seen (and discussed with SC support).

antonelepfl commented 4 years ago

@pramodk sometimes the simulation finishes successfully but I would like not to have the broken pipe anymore.

For the comment about the packages, I was just following the latest stack https://github.com/BlueBrain/spack/wiki/JURECA-Deployment#instructions-for-using-software-stack

pramodk commented 4 years ago

@pramodk sometimes the simulation finishes successfully but I would like not to have the broken pipe anymore.

Are you seeing that only recently?

For the comment about the packages, I was just following the latest stack https://github.com/BlueBrain/spack/wiki/JURECA-Deployment#instructions-for-using-software-stack

Yup, that's still up to date. We are updating modules with Jorge where we are replacing ParastationMPI library with IntelMPI. So if that works we don't have to worry about this error.

antonelepfl commented 4 years ago

Thank you @pramodk ,

Are you seeing that only recently?

Yes I would say a week ago or so

we are replacing ParastationMPI library with IntelMPI

Great! please keep me updated about this deployment so I can try it out

antonelepfl commented 4 years ago

@BerndSchuller I have run a simulation using as you suggested sbatch bss_submit_... and I don't get the errors. The manual run is under /p/scratch/cvsk25/unicore-jobs/327507b8-9bec-41e1-af50-c0f77edcfb2d

And launched with Unicore (and using both variables still get the broken pipe (but the simulation continues) in /p/scratch/cvsk25/unicore-jobs/483fb2ea-976f-49c7-8f0c-e4802d85c1e9

BerndSchuller commented 4 years ago

sorry I have no idea. The launch via UNICORE might(!) have different settings (e.g. library path), but without additional info and expertise from people who know what the code does, it's really not possible to say why these errors/warnings occur.

antonelepfl commented 4 years ago

Using the latest modules for the simulation fix this issue. Closing..