Closed boazmohar closed 1 year ago
Hi,
Interesting idea. The main reason for supporting Slurm only is that I have never had access to a distributed system using some other scheduler than Slurm. Python libraries such as Dask and Ray won't work as pi2 is not based on Python. We have a Python interface, but the core is not Python and the functionality is available through other interfaces, too.
Adding support for another scheduler should not be too hard, however. For that I need shell commands to
As a summary: if you can get me a temporary access to an LSF cluster, I can probably add support for LSF relatively easily.
Best, Arttu
Hi @arttumiettinen,
Thanks for the detailed replay! And this very impressive library. It worked amazingly for my data when it was down sampled on the first try.
I can certainly get you the commands and the options to do that on an LEF cluster. Also I am meeting today with our IT department, so I can ask about giving you remote access.
Otherwise I can try and debug if you would be willing to help me setup the development environment.
I have a few other questions regarding the best use parameters for the library on my data when it is not downsampled, any chance we could schedule a quick chat?
Thanks again! Boaz
Here is a start to the description for an LSF cluster also you can use this to compare commands.
Submit a job with bsub: bsub [options] command [arguments]
.
Common options:
-q
queue name (short
for <1h), (local
> 1h), would like to have the ability yo set.-n
Submits a parallel job and specifies the number of tasks in the job. (how many cpus)-e
Appends the standard error output of the job to the specified file path.-o
Appends the standard output of the job to the specified file path.-J
Assigns the specified name to the job-M
Sets a memory limit for all the processes that belong to the job.-W
Sets the runtime limit of the job. (3:00) in h-E
Runs the specified job-based pre-execution command on the execution host before actually running the job.-P
Assigns the job to the specified project.-r
Reruns a job if the execution host or the system fails; it does not rerun a job if the job itself fails.Example:
bsub -q short -n 32 -J example_name -e ~/error-%J.txt -o ~/output-%J.txt pi2 pi2options
Job info: bjobs [jobid]
% bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1266 user1 RUN normal hosta hostb sleep 60 Jun 5 17:39:58
options here From python we would query state using:
def get_status(jobids):
jobs = []
command = "bjobs -X -noheader -o \"JOBID STAT EXEC_HOST\" {jobid}> /dev/null".format(jobid=jobids)
bjobsout = subprocess.check_output(command, shell=True)
for outline in bjobsout.splitlines():
outline = outline.split()
job = {'jobid':outline[0], 'status':outline[1], 'host':outline[2].lstrip("16*")}
jobs.append(job)
return jobs
stop a job: bkill jobid
Get memory (would need a host name) lshosts
(base) [login1 - moharb@e05u15]~>lshosts -o "maxmem" e10u01
maxmem
772215M
Hi @boazmohar,
Here is a start to the description for an LSF cluster
Thanks, the information was very useful. I added an initial draft of LSF cluster support. It is found in the experimental branch, so in order to compile it you need to do
git checkout experimental
make -j16 NO_OPENCL=1
This is assuming the LSF system uses gcc compiler. You might need to load modules etc. before the compilation works.
The LSF cluster settings are found in file lsf_config.txt. There you can specify
In principle, the system should be able to submit jobs using bsub. Arguments -J, -o, -e, and -E are automatically set by the system, and the rest you can specify using the config file. I assume that bsub outputs a line like
Job <930> is submitted to default queue <something>.
and from that I parse the job id that is 930 in the above example.
The job state is queried using
bjobs -X -noheader -o "STAT" jobid
I assume that the command outputs one line, and
If necessary, the system cancels jobs using
bkill jobid
I have not yet figured out how to find out if a job is cancelled due to execution time limit, and how to list the amount of RAM in each node. The latter can probably be done using the lshosts somehow.
Perhaps you could try the current version and point out what goes wrong? Most probably everything is broken in this phase, so don't expect too much yet. For testing you can use Python commands like this:
pi2 = Pi()
pi2.distribute(Distributor.LSF)
pi2.maxmemory(1)
img = pi2.newimage(ImageDataType.UINT16, 100, 100, 100)
pi2.add(img, 10)
pi2.writeraw(img, "lsf_result")
In order to fix problems, I will need the output printed by commands similar to above.
I have a few other questions regarding the best use parameters for the library on my data when it is not downsampled, any chance we could schedule a quick chat?
Please contact me at my email arttu dot i dot miettinen at jyu dot fi to find out a suitable slot in our calendars.
Hi @arttumiettinen Wow that was a quick implementation. THANKS! I ran this:
In [4]:
...: from pi2py2 import *
In [5]: pi2 = Pi2()
...: pi2.distribute(Distributor.LSF)
...: pi2.maxmemory(1)
...: img = pi2.newimage(ImageDataType.UINT16, 100, 100, 100)
...: pi2.add(img, 10)
...: pi2.writeraw(img, "lsf_result")
Here is the first error:
Enabling distributed computing mode using LSF workload manager.
Memory per node in the LSF cluster: 175.78 GiB
Memory per node in the LSF cluster: 1 MiB
Job skipping is not allowed as there are in-place processed images that are not saved to the disk yet.
Submitting 2 jobs, each estimated to require at most 0.95 MiB of RAM...
bsub arguments: -J pi2-0-9830 -o ./lsf-io-files/pi2-0-9830-out.txt -e ./lsf-io-files/pi2-0-9830-err.txt -E hostname -q short -W 1 '/groups/svoboda/home/moharb/pi2/bin-linux64/release-nocl/pi2' ./lsf-io-files/pi2-0-9830-in.txt
bsub output: This job will be billed to svoboda
Job <111861622> is submitted to queue <short>.
Command
bsub -J pi2-0-9830 -o ./lsf-io-files/pi2-0-9830-out.txt -e ./lsf-io-files/pi2-0-9830-err.txt -E hostname -q short -W 1 '/groups/svoboda/home/moharb/pi2/bin-linux64/release-nocl/pi2' ./lsf-io-files/pi2-0-9830-in.txt
returned
This job will be billed to svoboda
Job <111861622> is submitted to queue <short>.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-5-95fee4ab1837> in <module>
4 img = pi2.newimage(ImageDataType.UINT16, 100, 100, 100)
5 pi2.add(img, 10)
----> 6 pi2.writeraw(img, "lsf_result")
~/pi2/bin-linux64/release-nocl/pi2py2.py in <lambda>(*args)
491 doc = self.pilib.help(self.piobj, f"{cmd_name}".encode('UTF-8')).decode('UTF-8')
492
--> 493 func = lambda *args: self.run_command(cmd_name, args)
494 func.__doc__ = doc
495 setattr(self, cmd_name, func)
~/pi2/bin-linux64/release-nocl/pi2py2.py in run_command(self, cmd_name, args)
640 cmd_line = f"{cmd_name}({arg_line})"
641
--> 642 self.run_script(cmd_line)
643
644 # Temporary images are automatically cleared when they go out of scope.
~/pi2/bin-linux64/release-nocl/pi2py2.py in run_script(self, script)
596
597 if not self.pilib.run(self.piobj, script.encode('UTF-8')):
--> 598 self.raise_last_error()
599
600
~/pi2/bin-linux64/release-nocl/pi2py2.py in raise_last_error(self)
586
587 err = self.pilib.lastErrorMessage(self.piobj).decode('UTF-8')
--> 588 raise RuntimeError(err)
589
590
RuntimeError: Unxpected bsub output.
In [6]:
Here is the lsf-io-files.zip And there is an image with value 10 and size 100x100x100 in tmp_images
looks pretty close to me :)
I will email you about seeitng a time soon.
Boaz
There was an ewasy fix for that:
There were 2 lines and the second line was the
vector<string> lines = split(result);
try
{
if (lines.size() == 2)
{
string line = lines[1];
if (startsWith(line, "Job <"))
No They are checked but I don't know why 2 jobs are submitted Here is a sample of the output:
Job skipping is not allowed as there are in-place processed images that are not saved to the disk yet.
Submitting 2 jobs, each estimated to require at most 0.95 MiB of RAM...
bsub arguments: -J pi2-0-3191 -o ./lsf-io-files/pi2-0-3191-out.txt -e ./lsf-io-files/pi2-0-3191-err.txt -E hostname -q short -W 1 '/groups/svoboda/home/moharb/pi2/bin-linux64/release-nocl/pi2' ./lsf-io-files/pi2-0-3191-in.txt
bsub output: This job will be billed to svoboda
Job <111862112> is submitted to queue <short>.
bsub arguments: -J pi2-1-3191 -o ./lsf-io-files/pi2-1-3191-out.txt -e ./lsf-io-files/pi2-1-3191-err.txt -E hostname -q short -W 1 '/groups/svoboda/home/moharb/pi2/bin-linux64/release-nocl/pi2' ./lsf-io-files/pi2-1-3191-in.txt
bsub output: This job will be billed to svoboda
Job <111862113> is submitted to queue <short>.
Waiting for jobs to finish...
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND
This repeats a lot more in a few seconds Then this happens:
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862114
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: DONE
Re-submitting failed job 1. (No error message available.)
bsub arguments: -J pi2-1-3191 -o ./lsf-io-files/pi2-1-3191-out.txt -e ./lsf-io-files/pi2-1-3191-err.txt -E hostname -q short -W 1 '/groups/svoboda/home/moharb/pi2/bin-linux64/release-nocl/pi2' ./lsf-io-files/pi2-1-3191-in.txt
bsub output: This job will be billed to svoboda
Job <111862115> is submitted to queue <short>.
bjobs arguments: -X -noheader -o "STAT" 111862114
bjobs output: RUN
bjobs arguments: -X -noheader -o "STAT" 111862115
bjobs output: PEND
it seems to be an issue with get last line of the output
Got it. need to add -Ne flag to bsub so it won't dirty the .out file string bsubArgs = string("") + "-J " + jobName + " -o " + outputName + " -e " + errorName + " -Ne " + initStr + extraArgs(jobType) + " " + jobCmdLine;
@arttumiettinen I hit another roadblock, could you help me figure out what I need to change in
base.py
function run_pi2(pi_script, output_prefix)
here
It still uses sbatch
and not lsf.
Thanks!
Sorry that it took some time to reply, but the latest commits to the experimental branch should now fix many of the problems you found above. A new version of the nr_stitcher script is now able to use LSF cluster, too. Could you @boazmohar try with the latest version to see what kind of errors we get now?
Closing this issue due to no interest from the OP.
Hi,
I was wondering if you have plans on supporting other distributed systems besides SLRUM. One idea is adding support for specific clusters (I only have access to a LSF cluster) or supporting another framework for the distributed part (like Dask or Ray) and they in turn support many deployment options Is that something you would consider?
Thanks!