Closed AldoCP closed 5 years ago
Unfortunately I could not reproduce the described issue by running a current master
build on any of our machines (example cube 1, BATCH_PROC_SYSTEM=None
). I addressed a potential issue in one of the subsequent steps (3/8 - Neuron rendering) in the latest commit.
Please provide me with the following additional information - if available:
What is the output of:
python -c "import syconn; syconn.global_params.wd = 'SyConn/scripts/example_run/wd6'; print(syconn.mp.batchjob_utils.batchjob_enabled())"
Currently there is a naive fallback for "example runs" (triggered if example
is part of working_dir
) which sets the returned value to False
(this should be the case here). Meaning, the value of BATCH_PROC_SYSTEM
should have no influence and the error probably arises somewhere else.
If a batch job fails, the folder which contains all instructions will not be deleted. Those folders are located in the SLURM
/'QSUB'/'None' (depending on the value of BATCH_PROC_SYSTEM
) folder inside the working directory. Are there more specific log messages in render_views_glia_removal/render_views_glia_removal.log
for BATCH_PROC_SYSTEM = None
or in predict_dense/predict_dense.log
for BATCH_PROC_SYSTEM = QSUB
?
The bash scripts executed by each worker of a "batchjob" are located inside the batch job folder (e.g. 'SyConn/scripts/example_run/wd6/None/render_views_glia_removal/sh/'). It might be helpful to run one of these scripts manually in the terminal.
Please keep me updated.
Hi Philipp,
The output of this command:
python -c "import syconn; syconn.global_params.wd = 'SyConn/scripts/example_run/wd6'; print(syconn.mp.batchjob_utils.batchjob_enabled())"
is as follows:
Traceback (most recent call last):
File "<string>", line 1, in <module>
AttributeError: module 'syconn' has no attribute 'mp'
Regarding the second point above, these are the contents of render_views_glia_removal_folder:
$ ls SyConn/scripts/example_run/wd5/None/render_views_glia_removal_folder/*
SyConn/scripts/example_run/wd5/None/render_views_glia_removal_folder/render_views_glia_removal.log
SyConn/scripts/example_run/wd5/None/render_views_glia_removal_folder/err:
SyConn/scripts/example_run/wd5/None/render_views_glia_removal_folder/log:
SyConn/scripts/example_run/wd5/None/render_views_glia_removal_folder/out:
SyConn/scripts/example_run/wd5/None/render_views_glia_removal_folder/sh:
job_0.sh job_1.sh job_2.sh job_3.sh
SyConn/scripts/example_run/wd5/None/render_views_glia_removal_folder/storage:
job_0.pkl job_1.pkl job_2.pkl job_3.pkl
And this file SyConn/scripts/example_run/wd5/None/render_views_glia_removal_folder/render_views_glia_removal.log
has basically 3 lines:
2019-08-19 21:13:02,772 (0.0min) - render_views_glia_removal - DEBUG - Started BatchJobFallback script "render_views_glia_removal" with 4 tasks using 1 parallel jobs, each using 10 core(s). 2019-08-19 21:36:14,641 (23.2min) - render_views_glia_removal - ERROR - Errors occurred during "render_views_glia_removal".:
plus a long line that logs the errors. You can find attached the entire logs file for reference.
Re third point above: running this script on the terminal:
SyConn/scripts/example_run/wd5/None/render_views_glia_removal_folder/sh/job_0.sh
gives pretty much the same output as the logs above.
Thanks for the feedback!
P.S.: On another note (which is probably not affecting anything in the current issue), installing SyConn with instructions from here was a bit tricky apparently due to new versions of some packages. I had to run the second method "2.b)", and do this for the last step:
pip install -e . --ignore-installed llvmlite
I'm attaching a list of installed packages, for reference.
packages_pysy4.txt
Regarding the first point, the following snippet should work now:
python -c "import syconn.mp.batchjob_utils; syconn.global_params.wd = 'SyConn/scripts/example_run/wd6'; print(syconn.mp.batchjob_utils.batchjob_enabled())"
I double checked the mechanism and it actually will be sensitive to the value of BATCH_PROC_SYSTEM
in your case. Could you please again attach the detailed log from The logs of the runs with render_views_glia_removal
, I can't find it in your previous post.BATCH_PROC_SYSTEM='SLURM'
and 'QSUB'
might also be helpful. Are both scheduling utilities installed on your system? Setting BATCH_PROC_SYSTEM=None
should work in any case though, even if none of them is installed.
Thanks for the reply. After pulling the latest commits, I've tested the command above on two systems: a single node (no queue handling at all), and the other one with qsub. Here are the outputs:
On the single node (no slurm/qsub):
2019-08-22 19:58:12 ip-10-245-33-107 syconn[78762] INFO Initialized stdout logging (level: 10). Current working directory: 'SyConn/scripts/example_run/wd99'
False
On the system with qsub:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "syconn/__init__.py", line 1, in <module>
from . import extraction, handler, proc, reps
File "syconn/extraction/__init__.py", line 9, in <module>
from ..handler.logger import log_main
File "syconn/handler/__init__.py", line 8, in <module>
from ..handler.logger import log_main
File "syconn/handler/logger.py", line 8, in <module>
from ..global_params import config
File "syconn/global_params.py", line 11, in <module>
from .handler.config import DynConfig
File "syconn/handler/config.py", line 42
def entries(self) -> Any:
^
SyntaxError: invalid syntax
Thanks,
The error is likely caused if no egl devices are available. This case was not handled appropriately yet and will now lead to a ValueError
. At the beginning of rendering.py
the import of the egl platform is tested, which does seem to work for you. I will need to look further into this. For now, could you please try and set PYOPENGL_PLATFORM = 'osmesa'
in global_params.py
instead?
Best
Regarding https://github.com/StructuralNeurobiologyLab/SyConn/issues/19#issuecomment-524055159: The single node response works as expected. In the qsub case it seems that python2 is used which is untested / not supported.
Thanks for the follow-up and sorry about the python2 confusion. After activating the proper conda environment (with python3, etc.), the qsub system gives the same output as the single node (namely, just False
).
Then, when applying PYOPENGL_PLATFORM = 'osmesa'
, and running this command:
python SyConn/examples/semseg_spine.py --kzip=1_spineexample.k.zip --working_dir=SyConn/scripts/example_run/wd003
the following error message comes up (this is the same on both single-node and qsub systems):
2019-08-22 20:25:12 login2 syconn[90067] INFO OSMESA rendering enabled.
Traceback (most recent call last):
File "SyConn/examples/semseg_spine.py", line 2, in <module>
from syconn.reps.super_segmentation import *
File "/home/SyConn/syconn/reps/super_segmentation.py", line 8, in <module>
from .super_segmentation_dataset import *
File "/home/SyConn/syconn/reps/super_segmentation_dataset.py", line 29, in <module>
from .super_segmentation_helper import create_sso_skeleton, associate_objs_with_skel_nodes
File "/home/SyConn/syconn/reps/super_segmentation_helper.py", line 40, in <module>
from ..proc.rendering import render_sso_coords
File "/home/SyConn/syconn/proc/rendering.py", line 88, in <module>
from .egl_ext import eglQueryDevicesEXT
File "/home/SyConn/syconn/proc/egl_ext.py", line 32, in <module>
from OpenGL import EGL
File "/home/anaconda3/envs/pysy4/lib/python3.6/site-packages/OpenGL/EGL/__init__.py", line 2, in <module>
from OpenGL.raw.EGL._types import *
File "/home/anaconda3/envs/pysy4/lib/python3.6/site-packages/OpenGL/raw/EGL/_types.py", line 73, in <module>
CALLBACK_TYPE = _p.PLATFORM.functionTypeFor( _p.PLATFORM.EGL )
Is it probably a pyopengl thing?
Best regards,
Yes, it looks like it. Could you double check if the error message you posted is complete? I will continue looking into it tomorrow and will hopefully get back to you the same day.
Best,
The error in https://github.com/StructuralNeurobiologyLab/SyConn/issues/19#issuecomment-524067135 probably was a result of running the egl
platform (not being set up properly) although osmesa
was enabled in the config file. The configuration mechanism of SyConn was fundamentally revised in #20 and bugs that led to mis-configurations should be fixed with previous updates (#21). I am closing this issue for now, @AldoCP please open it again if mismatches or problems with the configured batch processing system persist.
Apparently there is an issue when using SyConn on some qsub systems. I have been running the following command, using either the default using either the default
BATCH_PROC_SYSTEM = 'SLURM'
orBATCH_PROC_SYSTEM = None
on theglobal_params.py
file:python SyConn/scripts/example_run/start.py --example_cube=2 --working_dir=SyConn/scripts/example_run/wd6
SyConn's verbose shows this warning:
And, after a while, it fails wih this error:
In an additional test, setting
BATCH_PROC_SYSTEM = 'QSUB'
leads to an error message soon after launching the program:I'm attaching the logs from the
BATCH_PROC_SYSTEM = None
run. Am I probably missing some parameter? Your help will be appreciated, thanks!example_run.log glia_view_rendering.log create_rag.log create_sds.log dense_prediction_myelin.log