arttumiettinen / pi2

C++ library and command-line software for processing and analysis of terabyte-scale volume images locally or on a computing cluster.
GNU General Public License v3.0
48 stars 13 forks source link

Feature request: running on other distributed systems #4

Closed boazmohar closed 1 year ago

boazmohar commented 2 years ago

Hi,

I was wondering if you have plans on supporting other distributed systems besides SLRUM. One idea is adding support for specific clusters (I only have access to a LSF cluster) or supporting another framework for the distributed part (like Dask or Ray) and they in turn support many deployment options Is that something you would consider?

Thanks!

arttumiettinen commented 2 years ago

Hi,

Interesting idea. The main reason for supporting Slurm only is that I have never had access to a distributed system using some other scheduler than Slurm. Python libraries such as Dask and Ray won't work as pi2 is not based on Python. We have a Python interface, but the core is not Python and the functionality is available through other interfaces, too.

Adding support for another scheduler should not be too hard, however. For that I need shell commands to

As a summary: if you can get me a temporary access to an LSF cluster, I can probably add support for LSF relatively easily.

Best, Arttu

boazmohar commented 2 years ago

Hi @arttumiettinen,

Thanks for the detailed replay! And this very impressive library. It worked amazingly for my data when it was down sampled on the first try.

I can certainly get you the commands and the options to do that on an LEF cluster. Also I am meeting today with our IT department, so I can ask about giving you remote access.

Otherwise I can try and debug if you would be willing to help me setup the development environment.

I have a few other questions regarding the best use parameters for the library on my data when it is not downsampled, any chance we could schedule a quick chat?

Thanks again! Boaz

boazmohar commented 2 years ago

Here is a start to the description for an LSF cluster also you can use this to compare commands.

Submit a job with bsub: bsub [options] command [arguments]. Common options:

Example: bsub -q short -n 32 -J example_name -e ~/error-%J.txt -o ~/output-%J.txt pi2 pi2options

Job info: bjobs [jobid]

% bjobs
JOBID USER      STAT  QUEUE     FROM_HOST   EXEC_HOST   JOB_NAME    SUBMIT_TIME
1266  user1     RUN   normal    hosta       hostb       sleep 60    Jun 5 17:39:58

options here From python we would query state using:

def get_status(jobids):
  jobs = []
  command = "bjobs -X -noheader -o \"JOBID STAT EXEC_HOST\" {jobid}> /dev/null".format(jobid=jobids)
  bjobsout = subprocess.check_output(command, shell=True)
  for outline in bjobsout.splitlines():
    outline = outline.split()
    job = {'jobid':outline[0], 'status':outline[1], 'host':outline[2].lstrip("16*")}
    jobs.append(job)
  return jobs

stop a job: bkill jobid

Get memory (would need a host name) lshosts

(base) [login1 - moharb@e05u15]~>lshosts -o "maxmem" e10u01
maxmem
772215M
arttumiettinen commented 2 years ago

Hi @boazmohar,

Here is a start to the description for an LSF cluster

Thanks, the information was very useful. I added an initial draft of LSF cluster support. It is found in the experimental branch, so in order to compile it you need to do git checkout experimental make -j16 NO_OPENCL=1 This is assuming the LSF system uses gcc compiler. You might need to load modules etc. before the compilation works.

The LSF cluster settings are found in file lsf_config.txt. There you can specify

In principle, the system should be able to submit jobs using bsub. Arguments -J, -o, -e, and -E are automatically set by the system, and the rest you can specify using the config file. I assume that bsub outputs a line like Job <930> is submitted to default queue <something>. and from that I parse the job id that is 930 in the above example.

The job state is queried using bjobs -X -noheader -o "STAT" jobid I assume that the command outputs one line, and

If necessary, the system cancels jobs using bkill jobid

I have not yet figured out how to find out if a job is cancelled due to execution time limit, and how to list the amount of RAM in each node. The latter can probably be done using the lshosts somehow.

Perhaps you could try the current version and point out what goes wrong? Most probably everything is broken in this phase, so don't expect too much yet. For testing you can use Python commands like this:

pi2 = Pi()
pi2.distribute(Distributor.LSF)
pi2.maxmemory(1)
img = pi2.newimage(ImageDataType.UINT16, 100, 100, 100)
pi2.add(img, 10)
pi2.writeraw(img, "lsf_result")

In order to fix problems, I will need the output printed by commands similar to above.

I have a few other questions regarding the best use parameters for the library on my data when it is not downsampled, any chance we could schedule a quick chat?

Please contact me at my email arttu dot i dot miettinen at jyu dot fi to find out a suitable slot in our calendars.

boazmohar commented 2 years ago

Hi @arttumiettinen Wow that was a quick implementation. THANKS! I ran this:

In [4]:
   ...: from pi2py2 import *

In [5]: pi2 = Pi2()
   ...: pi2.distribute(Distributor.LSF)
   ...: pi2.maxmemory(1)
   ...: img = pi2.newimage(ImageDataType.UINT16, 100, 100, 100)
   ...: pi2.add(img, 10)
   ...: pi2.writeraw(img, "lsf_result")

Here is the first error:

Enabling distributed computing mode using LSF workload manager.
Memory per node in the LSF cluster: 175.78 GiB
Memory per node in the LSF cluster: 1 MiB
Job skipping is not allowed as there are in-place processed images that are not saved to the disk yet.
Submitting 2 jobs, each estimated to require at most 0.95 MiB of RAM...
bsub arguments: -J pi2-0-9830 -o ./lsf-io-files/pi2-0-9830-out.txt -e ./lsf-io-files/pi2-0-9830-err.txt -E hostname -q short -W 1 '/groups/svoboda/home/moharb/pi2/bin-linux64/release-nocl/pi2' ./lsf-io-files/pi2-0-9830-in.txt
bsub output: This job will be billed to svoboda
Job <111861622> is submitted to queue <short>.

Command
bsub -J pi2-0-9830 -o ./lsf-io-files/pi2-0-9830-out.txt -e ./lsf-io-files/pi2-0-9830-err.txt -E hostname -q short -W 1 '/groups/svoboda/home/moharb/pi2/bin-linux64/release-nocl/pi2' ./lsf-io-files/pi2-0-9830-in.txt
returned
This job will be billed to svoboda
Job <111861622> is submitted to queue <short>.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-5-95fee4ab1837> in <module>
      4 img = pi2.newimage(ImageDataType.UINT16, 100, 100, 100)
      5 pi2.add(img, 10)
----> 6 pi2.writeraw(img, "lsf_result")

~/pi2/bin-linux64/release-nocl/pi2py2.py in <lambda>(*args)
    491         doc = self.pilib.help(self.piobj, f"{cmd_name}".encode('UTF-8')).decode('UTF-8')
    492
--> 493         func = lambda *args: self.run_command(cmd_name, args)
    494         func.__doc__ = doc
    495         setattr(self, cmd_name, func)

~/pi2/bin-linux64/release-nocl/pi2py2.py in run_command(self, cmd_name, args)
    640         cmd_line = f"{cmd_name}({arg_line})"
    641
--> 642         self.run_script(cmd_line)
    643
    644         # Temporary images are automatically cleared when they go out of scope.

~/pi2/bin-linux64/release-nocl/pi2py2.py in run_script(self, script)
    596
    597         if not self.pilib.run(self.piobj, script.encode('UTF-8')):
--> 598             self.raise_last_error()
    599
    600

~/pi2/bin-linux64/release-nocl/pi2py2.py in raise_last_error(self)
    586
    587         err = self.pilib.lastErrorMessage(self.piobj).decode('UTF-8')
--> 588         raise RuntimeError(err)
    589
    590

RuntimeError: Unxpected bsub output.

In [6]:

Here is the lsf-io-files.zip And there is an image with value 10 and size 100x100x100 in tmp_images

looks pretty close to me :)

I will email you about seeitng a time soon.

Boaz

boazmohar commented 2 years ago

There was an ewasy fix for that: There were 2 lines and the second line was the one. I fixed that

        vector<string> lines = split(result);
        try
        {
            if (lines.size() == 2)
            {
                string line = lines[1];
                if (startsWith(line, "Job <"))

No They are checked but I don't know why 2 jobs are submitted Here is a sample of the output:

Job skipping is not allowed as there are in-place processed images that are not saved to the disk yet.
Submitting 2 jobs, each estimated to require at most 0.95 MiB of RAM...
bsub arguments: -J pi2-0-3191 -o ./lsf-io-files/pi2-0-3191-out.txt -e ./lsf-io-files/pi2-0-3191-err.txt -E hostname -q short -W 1 '/groups/svoboda/home/moharb/pi2/bin-linux64/release-nocl/pi2' ./lsf-io-files/pi2-0-3191-in.txt
bsub output: This job will be billed to svoboda
Job <111862112> is submitted to queue <short>.

bsub arguments: -J pi2-1-3191 -o ./lsf-io-files/pi2-1-3191-out.txt -e ./lsf-io-files/pi2-1-3191-err.txt -E hostname -q short -W 1 '/groups/svoboda/home/moharb/pi2/bin-linux64/release-nocl/pi2' ./lsf-io-files/pi2-1-3191-in.txt
bsub output: This job will be billed to svoboda
Job <111862113> is submitted to queue <short>.

Waiting for jobs to finish...
bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

bjobs arguments: -X -noheader -o "STAT" 111862112
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: PEND

This repeats a lot more in a few seconds Then this happens:

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862114
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862113
bjobs output: DONE

Re-submitting failed job 1. (No error message available.)
bsub arguments: -J pi2-1-3191 -o ./lsf-io-files/pi2-1-3191-out.txt -e ./lsf-io-files/pi2-1-3191-err.txt -E hostname -q short -W 1 '/groups/svoboda/home/moharb/pi2/bin-linux64/release-nocl/pi2' ./lsf-io-files/pi2-1-3191-in.txt
bsub output: This job will be billed to svoboda
Job <111862115> is submitted to queue <short>.

bjobs arguments: -X -noheader -o "STAT" 111862114
bjobs output: RUN

bjobs arguments: -X -noheader -o "STAT" 111862115
bjobs output: PEND

it seems to be an issue with get last line of the output

boazmohar commented 2 years ago

Got it. need to add -Ne flag to bsub so it won't dirty the .out file string bsubArgs = string("") + "-J " + jobName + " -o " + outputName + " -e " + errorName + " -Ne " + initStr + extraArgs(jobType) + " " + jobCmdLine;

boazmohar commented 2 years ago

@arttumiettinen I hit another roadblock, could you help me figure out what I need to change in base.py function run_pi2(pi_script, output_prefix) here It still uses sbatch and not lsf. Thanks!

arttumiettinen commented 2 years ago

Sorry that it took some time to reply, but the latest commits to the experimental branch should now fix many of the problems you found above. A new version of the nr_stitcher script is now able to use LSF cluster, too. Could you @boazmohar try with the latest version to see what kind of errors we get now?

arttumiettinen commented 1 year ago

Closing this issue due to no interest from the OP.