Problem with single node multiprocessing (changing BATCH_PROC_SYSTEM does not work)

StructuralNeurobiologyLab / SyConn

Toolkit for the generation and analysis of volume eletron microscopy based synaptic connectomes of brain tissue.

http://structuralneurobiologylab.github.io/SyConn/

GNU General Public License v2.0

40 stars 9 forks source link

Problem with single node multiprocessing (changing BATCH_PROC_SYSTEM does not work) #19

Closed AldoCP closed 5 years ago

AldoCP commented 5 years ago

Apparently there is an issue when using SyConn on some qsub systems. I have been running the following command, using either the default using either the default BATCH_PROC_SYSTEM = 'SLURM' or BATCH_PROC_SYSTEM = None on the global_params.py file: python SyConn/scripts/example_run/start.py --example_cube=2 --working_dir=SyConn/scripts/example_run/wd6

SyConn's verbose shows this warning:

BatchJobSystem 'SLURM' specified but failed with error 'Command 'squeue' returned non-zero exit status 127.' not found, switching to single node multiprocessing.

And, after a while, it fails wih this error:

Traceback (most recent call last):
  File "SyConn/scripts/example_run/start.py", line 173, in <module>
    exec_multiview.run_glia_rendering()
  File "/home/SyConn/syconn/exec/exec_multiview.py", line 887, in run_glia_rendering
    additional_flags="--gres=gpu:1", remove_jobfolder=True)
  File "/home/SyConn/syconn/mp/batchjob_utils.py", line 154, in QSUB_script
    show_progress=show_progress)
  File "/home/SyConn/syconn/mp/batchjob_utils.py", line 581, in batchjob_fallback
    out_files), len(params)))
ValueError: 4/4 Batchjob fallback worker failed.

In an additional test, setting BATCH_PROC_SYSTEM = 'QSUB' leads to an error message soon after launching the program:

 predict_dense[15839] ERROR All submitted jobs have failed. Re-submission will not be initiated. Please check                                 your submitted code.
Traceback (most recent call last):
  File "SyConn/scripts/example_run/start.py", line 159, in <module>
    exec_dense_prediction.predict_myelin()  # myelin is not needed before `run_create_neuron_ssd`
  File "/home/SyConn/syconn/exec/exec_dense_prediction.py", line 42, in predict_myelin
    target_channels=[(1, )], target_names=['myelin'])
  File "/home/SyConn/syconn/handler/prediction.py", line 633, in predict_dense_to_kd
    n_cores=n_cores_per_job, remove_jobfolder=True)
  File "/home/SyConn/syconn/mp/batchjob_utils.py", line 351, in QSUB_script
    raise Exception(msg)
Exception: All submitted jobs have failed. Re-submission will not be initiated. Please check your submitted code.

I'm attaching the logs from the BATCH_PROC_SYSTEM = None run. Am I probably missing some parameter? Your help will be appreciated, thanks!

example_run.log glia_view_rendering.log create_rag.log create_sds.log dense_prediction_myelin.log

pschubert commented 5 years ago

Unfortunately I could not reproduce the described issue by running a current master build on any of our machines (example cube 1, BATCH_PROC_SYSTEM=None). I addressed a potential issue in one of the subsequent steps (3/8 - Neuron rendering) in the latest commit.

Please provide me with the following additional information - if available:

What is the output of:
```
python -c "import syconn; syconn.global_params.wd = 'SyConn/scripts/example_run/wd6'; print(syconn.mp.batchjob_utils.batchjob_enabled())"
```
Currently there is a naive fallback for "example runs" (triggered if example is part of working_dir) which sets the returned value to False (this should be the case here). Meaning, the value of BATCH_PROC_SYSTEM should have no influence and the error probably arises somewhere else.
If a batch job fails, the folder which contains all instructions will not be deleted. Those folders are located in the SLURM/'QSUB'/'None' (depending on the value of BATCH_PROC_SYSTEM) folder inside the working directory. Are there more specific log messages in render_views_glia_removal/render_views_glia_removal.log for BATCH_PROC_SYSTEM = None or in predict_dense/predict_dense.log for BATCH_PROC_SYSTEM = QSUB?
The bash scripts executed by each worker of a "batchjob" are located inside the batch job folder (e.g. 'SyConn/scripts/example_run/wd6/None/render_views_glia_removal/sh/'). It might be helpful to run one of these scripts manually in the terminal.

Please keep me updated.

AldoCP commented 5 years ago

Hi Philipp,

The output of this command:

python -c "import syconn; syconn.global_params.wd = 'SyConn/scripts/example_run/wd6'; print(syconn.mp.batchjob_utils.batchjob_enabled())"

is as follows:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: module 'syconn' has no attribute 'mp'

Regarding the second point above, these are the contents of render_views_glia_removal_folder:

$ ls SyConn/scripts/example_run/wd5/None/render_views_glia_removal_folder/*
SyConn/scripts/example_run/wd5/None/render_views_glia_removal_folder/render_views_glia_removal.log

SyConn/scripts/example_run/wd5/None/render_views_glia_removal_folder/err:

SyConn/scripts/example_run/wd5/None/render_views_glia_removal_folder/log:

SyConn/scripts/example_run/wd5/None/render_views_glia_removal_folder/out:

SyConn/scripts/example_run/wd5/None/render_views_glia_removal_folder/sh:
job_0.sh  job_1.sh  job_2.sh  job_3.sh

SyConn/scripts/example_run/wd5/None/render_views_glia_removal_folder/storage:
job_0.pkl  job_1.pkl  job_2.pkl  job_3.pkl

And this file SyConn/scripts/example_run/wd5/None/render_views_glia_removal_folder/render_views_glia_removal.log has basically 3 lines:

2019-08-19 21:13:02,772 (0.0min) - render_views_glia_removal - DEBUG - Started BatchJobFallback script "render_views_glia_removal" with 4 tasks using 1 parallel jobs, each using 10 core(s). 2019-08-19 21:36:14,641 (23.2min) - render_views_glia_removal - ERROR - Errors occurred during "render_views_glia_removal".:

plus a long line that logs the errors. You can find attached the entire logs file for reference.

Re third point above: running this script on the terminal: SyConn/scripts/example_run/wd5/None/render_views_glia_removal_folder/sh/job_0.sh

gives pretty much the same output as the logs above.

Thanks for the feedback!

P.S.: On another note (which is probably not affecting anything in the current issue), installing SyConn with instructions from here was a bit tricky apparently due to new versions of some packages. I had to run the second method "2.b)", and do this for the last step: pip install -e . --ignore-installed llvmlite I'm attaching a list of installed packages, for reference. packages_pysy4.txt

pschubert commented 5 years ago

Regarding the first point, the following snippet should work now:

python -c "import syconn.mp.batchjob_utils; syconn.global_params.wd = 'SyConn/scripts/example_run/wd6'; print(syconn.mp.batchjob_utils.batchjob_enabled())"

I double checked the mechanism and it actually will be sensitive to the value of BATCH_PROC_SYSTEM in your case. ~~Could you please again attach the detailed log from render_views_glia_removal, I can't find it in your previous post.~~ The logs of the runs with BATCH_PROC_SYSTEM='SLURM' and 'QSUB' might also be helpful. Are both scheduling utilities installed on your system? Setting BATCH_PROC_SYSTEM=None should work in any case though, even if none of them is installed.

AldoCP commented 5 years ago

Thanks for the reply. After pulling the latest commits, I've tested the command above on two systems: a single node (no queue handling at all), and the other one with qsub. Here are the outputs:

On the single node (no slurm/qsub):

2019-08-22 19:58:12 ip-10-245-33-107 syconn[78762] INFO Initialized stdout logging (level: 10). Current working directory: 'SyConn/scripts/example_run/wd99'
False

On the system with qsub:


Traceback (most recent call last):
File "<string>", line 1, in <module>
File "syconn/__init__.py", line 1, in <module>
from . import extraction, handler, proc, reps
File "syconn/extraction/__init__.py", line 9, in <module>
from ..handler.logger import log_main
File "syconn/handler/__init__.py", line 8, in <module>
from ..handler.logger import log_main
File "syconn/handler/logger.py", line 8, in <module>
from ..global_params import config
File "syconn/global_params.py", line 11, in <module>
from .handler.config import DynConfig
File "syconn/handler/config.py", line 42
def entries(self) -> Any:
                  ^
SyntaxError: invalid syntax



Thanks,

pschubert commented 5 years ago

The error is likely caused if no egl devices are available. This case was not handled appropriately yet and will now lead to a ValueError. At the beginning of rendering.py the import of the egl platform is tested, which does seem to work for you. I will need to look further into this. For now, could you please try and set PYOPENGL_PLATFORM = 'osmesa'in global_params.py instead?

Best

pschubert commented 5 years ago

Regarding https://github.com/StructuralNeurobiologyLab/SyConn/issues/19#issuecomment-524055159: The single node response works as expected. In the qsub case it seems that python2 is used which is untested / not supported.

AldoCP commented 5 years ago

Thanks for the follow-up and sorry about the python2 confusion. After activating the proper conda environment (with python3, etc.), the qsub system gives the same output as the single node (namely, just False).

Then, when applying PYOPENGL_PLATFORM = 'osmesa', and running this command: python SyConn/examples/semseg_spine.py --kzip=1_spineexample.k.zip --working_dir=SyConn/scripts/example_run/wd003

the following error message comes up (this is the same on both single-node and qsub systems):

2019-08-22 20:25:12 login2 syconn[90067] INFO OSMESA rendering enabled.
Traceback (most recent call last):
  File "SyConn/examples/semseg_spine.py", line 2, in <module>
    from syconn.reps.super_segmentation import *
  File "/home/SyConn/syconn/reps/super_segmentation.py", line 8, in <module>
    from .super_segmentation_dataset import *
  File "/home/SyConn/syconn/reps/super_segmentation_dataset.py", line 29, in <module>
    from .super_segmentation_helper import create_sso_skeleton, associate_objs_with_skel_nodes
  File "/home/SyConn/syconn/reps/super_segmentation_helper.py", line 40, in <module>
    from ..proc.rendering import render_sso_coords
  File "/home/SyConn/syconn/proc/rendering.py", line 88, in <module>
    from .egl_ext import eglQueryDevicesEXT
  File "/home/SyConn/syconn/proc/egl_ext.py", line 32, in <module>
    from OpenGL import EGL
  File "/home/anaconda3/envs/pysy4/lib/python3.6/site-packages/OpenGL/EGL/__init__.py", line 2, in <module>
    from OpenGL.raw.EGL._types import *
  File "/home/anaconda3/envs/pysy4/lib/python3.6/site-packages/OpenGL/raw/EGL/_types.py", line 73, in <module>
    CALLBACK_TYPE = _p.PLATFORM.functionTypeFor( _p.PLATFORM.EGL )

Is it probably a pyopengl thing?

Best regards,

pschubert commented 5 years ago

Yes, it looks like it. Could you double check if the error message you posted is complete? I will continue looking into it tomorrow and will hopefully get back to you the same day.

Best,

pschubert commented 5 years ago

The error in https://github.com/StructuralNeurobiologyLab/SyConn/issues/19#issuecomment-524067135 probably was a result of running the egl platform (not being set up properly) although osmesa was enabled in the config file. The configuration mechanism of SyConn was fundamentally revised in #20 and bugs that led to mis-configurations should be fixed with previous updates (#21). I am closing this issue for now, @AldoCP please open it again if mismatches or problems with the configured batch processing system persist.