Wexac Cluster at Weizmann

ax3l commented 3 years ago

Hi,

I got contacted by Dan Levy @danlevy100 about help with setting up PIConGPU on the Wexac cluster at Weizmann Institute of Science (wexac-wis). The cluster has 12 nodes of 8x V100 per node (plus some nodes with 4x V100).

The cluster uses LSF as a batch system but does not seem to use jsrun (maybe just use mpiexec). He got PIConGPU installed via Spack already.

This is an interactive startup command for FBPIC:

bsub -J sim_fbpic -o out.%J -e err.%J -q gpu-short -gpu "num=1:mode=shared:j_exclusive=no" -R "rusage[mem=16000]" 'python lwfa_script.py’

Someone please needs to finalize with him the .tpl template for tbg and the picongpu.profile instructions for our manual.

Resources:

https://www.weizmann.ac.il/DIS/high-performance-computing/wexac-quick-guide/general
queue-names: gpu-short, gpu-medium and gpu-long
bsub rusage: https://www.ibm.com/support/pages/rusage-bsub (check if mem is per node or for the whole job) or maybe use -M

cc @PrometheusPi (recently published PIConGPU sims with Dan, maybe you can finalize this?) cc @hightower8083 (not with Weizmann anymore but might have some hints)

danlevy100 commented 3 years ago

Hi guys and welcome to my first github comment!

Here's the .tpl file Axel helped me to create: gpu_batch.tpl.txt

Sadly things are not yet working, i.e., I can't get tbg to submit to the user given queue at the moment.

Thanks in advance for your help!

PrometheusPi commented 3 years ago

@danlevy100 I would be glad to help you setting up the configuration for Wexac. Since I am busy till Tuesday evening, I could start to look into this on Wednesday. Would this be fine with you?

ax3l commented 3 years ago

Thank you for taking care of this, @PrometheusPi :+1:

danlevy100 commented 3 years ago

That would be great, @PrometheusPi. Thanks! I'll try to make some progress on my own in the meantime.

ax3l commented 3 years ago

@danlevy100 can you please document the current error message about the memory here?

danlevy100 commented 3 years ago

After submitting the LaserWakefield example with

tbg -s bsub -c etc/picongpu/1.cfg -t etc/picongpu/wexac-wis/gpu_batch.tpl ~/picOutput/LaserWakefield -f

I get: Memory reservation is (MB): 8192 Memory Limit is (MB): 8192 femalka: No such queue. Job not submitted.

"femalka" is Victor's username in fact... I have no idea why it appears here.

sbastrakov commented 3 years ago

In order to figure it out, it would be helpful to see what is the resulting submission command after tbg and your .tpl file is applied. For the provided tbg command line there should be a file ~/picOutput/LaserWakefield/tbg/submit.start. If it is there, could you attach it? It should contain, among other things, the plain bsub command inside, and so we can compare to what @ax3l wrote for FBPIC.

danlevy100 commented 3 years ago

Sure, here it is: submit.start.txt

sbastrakov commented 3 years ago

Thank you @danlevy100 .

So far I see an issue in the gpu_batch.tpl file attached earlier in this topic. On line 30 there is a spurious space in #B SUB (should be #BSUB). I believe it causes that and the following #BSUB lines to have no effect, and so leads to improper set of parameters. I don't know if it is the only issue, and do not have access to a similar machine to check.

sbastrakov commented 3 years ago

In case it does not fix the problem, I think the relevant information may be not just in submit.start, as otherwise it looks fine to me. According to the documentation linked by Axel, probably it is worth looking into output of bjobs -l JOBID to see if the partition and other things are being set correctly.

danlevy100 commented 3 years ago

Thanks @sbastrakov for having a look. I saw this but thought that maybe it was just to comment out the line. I have removed it anyway and submitted and still no go. As for bjobs -l, since the job is not being submitted there is no information displayed about it.

PrometheusPi commented 3 years ago

Sine the error message was about memory, could you please decrease the set memory from #BSUB -M 8192 to half of it? Why is there so much more memory used for the fbpic runs? (16000), or is this defined differently?

EDIT: This definition is in kB - thus 8192 kB is definitely very low - please adjust the memory needed accordingly and use the same as with fbpic:

-R "rusage[mem=16000]"

PrometheusPi commented 3 years ago

Furthermore you seem to not define a project #BSUB -P. I am not sure how this is handled since your fbpic run does not define a project, I assume you have a default one or none are used. Please try to remove that line - perhaps seting an empty projects creates an error while setting none just uses the default.

PrometheusPi commented 3 years ago

If this does not work, we could schedule a video meeting try try things out live.

ax3l commented 3 years ago

Yep, we tried those already. I guess you will be most efficient with a VC :)

danlevy100 commented 3 years ago

Something that should be mentioned: the way things are set up is that I have installed picongpu at the node level ("interactive session" like getNode on hemera). Submitting a job is thus only possible at the node level. Perhaps this was a mistake, but I could not get things to work otherwise.

When submitting a job, it appears that the memory is limited by the memory requested for the interactive session. Strange, but I think that it is the case.

Also, the error as far as I understand it is not a memory error but a "femalka: No such queue" error.

VC would be great. I'm available throughout most of the day tomorrow and on Friday if that works for you.

PrometheusPi commented 3 years ago

@danlevy100 Okay then let's do a VC tomorrow. @sbastrakov Do you want to join as well?

sbastrakov commented 3 years ago

I can

danlevy100 commented 3 years ago

14:00 Dresden time works for you?

PrometheusPi commented 3 years ago

@danlevy100 That would be fine with me. How about you @sbastrakov ? In order to better work together on the submit file (other than us suggesting changes on your submit file we see via screen sharing) I would recommend the Atom editor together with the teletypepackage, so that we all an type together. Would that be fine with you two?

PrometheusPi commented 3 years ago

@danlevy100 Is the following submit script queued/executed by LSF?

#!/usr/bin/env bash
#BSUB -J test 
#BSUB -o test.out 
#BSUB -e test.err
#BSUB -q gpu-short 
#BSUB -gpu "num=1:mode=shared:j_exclusive=no" 
#BSUB -R "rusage[mem=16000]" 

hostname
nvidia-smi

and than just submitted via bsub without extra arguments?

sbastrakov commented 3 years ago

That is fine with me as well

danlevy100 commented 3 years ago

@danlevy100 Is the following submit script queued/executed by LSF?
#!/usr/bin/env bash
#BSUB -J test 
#BSUB -o test.out 
#BSUB -e test.err
#BSUB -q gpu-short 
#BSUB -gpu "num=1:mode=shared:j_exclusive=no" 
#BSUB -R "rusage[mem=16000]" 

hostname
nvidia-smi
and than just submitted via bsub without extra arguments?

bsub in fact fails with the same error.

danlevy100 commented 3 years ago

I could also try to get a cluster admin to join our meeting, do you think this could prove useful?

danlevy100 commented 3 years ago

UPDATE:

@danlevy100 Is the following submit script queued/executed by LSF?
#!/usr/bin/env bash
#BSUB -J test 
#BSUB -o test.out 
#BSUB -e test.err
#BSUB -q gpu-short 
#BSUB -gpu "num=1:mode=shared:j_exclusive=no" 
#BSUB -R "rusage[mem=16000]" 

hostname
nvidia-smi
and than just submitted via bsub without extra arguments?
bsub in fact fails with the same error.

UPDATE: I got this script to work. The secret is to execute bsub < test_script.sh and not bsub test_script.sh.

PrometheusPi commented 3 years ago

Yes, if a cluster admin could join the meeting, that would be great :+1:

PrometheusPi commented 3 years ago

@danlevy100 Yes, you are right, this seems to be different in bsub - see https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/lsf_admin/job_scripts_writing.html

PrometheusPi commented 3 years ago

@danlevy100 According to above reference, in lsf.conf the LSB_BSUB_PARSE_SCRIPT parameter should be set to Y. Could you please check whether this variable can be overwritten on a shell level:

echo $LSB_BSUB_PARSE_SCRIPT
export LSB_BSUB_PARSE_SCRIPT="Y"
bsub test_file_from_above.sh

danlevy100 commented 3 years ago

@danlevy100 According to above reference, in lsf.conf the LSB_BSUB_PARSE_SCRIPT parameter should be set to Y. Could you please check whether this variable can be overwritten on a shell level:
echo $LSB_BSUB_PARSE_SCRIPT
export LSB_BSUB_PARSE_SCRIPT="Y"
bsub test_file_from_above.sh

Setting the variable does work (i.e., echoing it after export gives Y) but bsub still doesn't work without the <.

PrometheusPi commented 3 years ago

This explains why your job is not submitted with tbg, because bsub tbg/submit.start makes no sense if bsub < tbg/submit.start would be needed.

PrometheusPi commented 3 years ago

@danlevy100 Could you please try:

bsub  -Zs test_file_from_above.sh

sbastrakov commented 3 years ago

i guess it is an explanation. To try work around it, one can add this < by manually modifying this piece of tbg to be $submit_command < tbg/submit.start. Or we can do it together in a VC

danlevy100 commented 3 years ago

@danlevy100 Could you please try:
bsub  -Zs test_file_from_above.sh

Still no go

sbastrakov commented 3 years ago

Wait, but also tbg/submit.start is a shell script already. So perhaps just the first line of that has to be changed? Like to #!/bin/bash as given in the quick guide

danlevy100 commented 3 years ago

This explains why your job is not submitted with tbg, because bsub tbg/submit.start makes no sense if bsub < tbg/submit.start would be needed.

I have actually tried bsub < tbg/submit.start and got a new error:

When LSB_CSM_JOBS is not set to Y and -csm is not set then other csm options are not allowed. Job not submitted.

danlevy100 commented 3 years ago

Submitting again after setting export LSB_CSM_JOBS="Y" gives:

You cannot specify -R/-M/-n/LSB_DEFAULT_RESREQ when CSM Easy Mode job submission is enabled. Job not submitted.

PrometheusPi commented 3 years ago

This definitely looks as if the cluster does not allow for scripted job files. Thus if LSB_CSM_JOBS != Y there are other options disabled which were used in the *.cfg. This is something a cluster admin has to answer how they would like to handle job scripts on their cluster.

danlevy100 commented 3 years ago

The admin will be there. Hopefully we could solve it together with him.

https://weizmann.zoom.us/j/99360687398?pwd=dkhTeDJQaHltYWlnelM5cnNaR2o4UT09

PrometheusPi commented 3 years ago

We could get PIConGPU to run, but it only worked if the task were on one the same node. mpiexec -n seems to only schedule tasks on the MPI-rank=0 node. However, LSF schedules nodes, as can be seen when checking the variable $LSB_HOSTS while running. Thus there seems to be some misconfiguration on how mpiexec finds the available machines (it seems to only use the first in that list.) To get to multiple nodes, we manually defined a machinefile and used it via mpiexec --machinefile as follows:

echo $LSB_HOSTS | sed -e 's/ /\n/g' > machinefile.txt
mpiexec -n 16 --machinefile machinefile.txt hostname

This apparently told mpiexec to use the nodes scheduled by LSF, but when mpiexec tried to connect to these nodes via ssh, it failed with a authentication error.

psychocoderHPC commented 3 years ago

Is it possible to ask the admin how to start job MPI obs on multiple nodes? I would say MPI is not compiled with support for the batch system therefore MPI is not using the information stored in $LSB_HOSTS

sbastrakov commented 3 years ago

Yes, that's the plan. It was told to us that not many run multi-node jobs there and it may require a certain MPI version to work. Which shouldn't be a problem once we know which version it is.

PrometheusPi commented 3 years ago

Update: There is no password-less ssh into GPU nodes. This is only available on non-GPU nodes or for admins. The next test will be together with an admin to run multi-node GPU jobs in admin-mode.

danlevy100 commented 3 years ago

Another update: We eventually gave up the spack approach and went for modules.

Using openmpi/2.0.1 we finally got mpi to work today. We successfully ran a "bare-bones" version of picongpu. Now we need to install the remaining modules which I will do with the help of the cluster admin next week.

Here is the simple.profile file that we used:

module load gcc/6.3.0
module load cmake/3.18.4
module load openmpi/2.0.1
module load cuda/9.2
module load boost/1.69.0

export CXX=$(which g++)
export CC=$(which gcc)

export PICSRC=$HOME/src/picongpu
export PIC_EXAMPLES=$PICSRC/share/picongpu/examples
export PIC_BACKEND="cuda:70"

export PATH=$PATH:$PICSRC
export PATH=$PATH:$PICSRC/bin
export PATH=$PATH:$PICSRC/src/tools/bin

PrometheusPi commented 3 years ago

@danlevy100 As promised on Monday, you can find a setup script here: https://gist.github.com/PrometheusPi/3b873c754fbb0f0a2684480d0969410f

Please be aware of the comments that state which lines should be copied to your picongpu.profile as well.

I have not yet tested that script. Thus there still might be some bugs included. If any install fails, please let me know.

After you installed all dependencies, you should be able to run PIConGPU as on hemera. If that is the case, I would be very happy if you could share a submit.start file here, so that we can develop a general *.tpl file based for the Wexac cluster.

danlevy100 commented 3 years ago

@PrometheusPi That really wonderful. Thanks!!

I gave it a shot and ran into a couple of issues:

The curl command for zlib should include the filename as well. Not a problem, installed correctly.
openPMD failed to install. I tried but could not solve it. The error log is attached.

openPMD_install_fail.txt

psychocoderHPC commented 3 years ago

@PrometheusPi That really wonderful. Thanks!!

I gave it a shot and ran into a couple of issues:

The curl command for zlib should include the filename as well. Not a problem, installed correctly.

openPMD failed to install. I tried but could not solve it. The error log is attached.

openPMD_install_fail.txt

The linker error is saying you should compile ADIOS with -fPIC enabled. You should use

./configure CFLAGS=-fPIC CXXFLAGS=-fPIC --enable-static --enable-shared --prefix=$LIB/adios --with-mpi=$MPI_ROOT --with-zlib=$LIB/zlib --with-blosc=$LIB/c-blosc

@PrometheusPi Could you please update your gist.

For testing ADIOS1 is fine but I would suggest switching to ADIOS2 because there is no real support for ADIOS1 anymore and openPMD-api is also working much better with ADIOS2.

PrometheusPi commented 3 years ago

@danlevy100 Thanks - yes I quickly changed my initial wget command to curl but forget that curl requires files. 😓 I now changed it back to wget.

@psychocoderHPC I fixed the gist. Thanks for your look at it. Is the readthedocs documentation correct or is the order wrong or the CXX flag missing?

PrometheusPi commented 3 years ago

@danlevy100 It might be that you have to rebuild libpng as well. It might have linked to the system zlib, not the one you installed. I fixed the gist on that.

danlevy100 commented 3 years ago

Alright, it seems like everything installed fine. However running the simulation fails with an openPMD error:

../LWF/input/bin/picongpu: error while loading shared libraries: libopenPMD.so: cannot open shared object file: No such file or directory

I reinstalled everything with the new script, rebuilt the simulation but still no go.

P.S. there's a small typo at the gist: wegt -> wget.

PrometheusPi commented 3 years ago

@danlevy100 Sorry for the typo 😓 - I fixed the gist.

I have an idea: could you please check, whether in $LIB/openPMD-api/ there is a lib directory? Or is there only a lib64 directory? If there is only a lib64 directory, please change the LD_LIBRARY_PATH extension to:

-export LD_LIBRARY_PATH="$LIB/openPMD-api/lib:$LD_LIBRARY_PATH"
+export LD_LIBRARY_PATH="$LIB/openPMD-api/lib64:$LD_LIBRARY_PATH"

ComputationalRadiationPhysics / picongpu

Wexac Cluster at Weizmann #3496