OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.32k stars 1.49k forks source link

thread problem #1767

Closed wangjiawen2013 closed 5 years ago

wangjiawen2013 commented 6 years ago

Dear, Could you spare some time to read this issue: https://github.com/tmoerman/arboreto/issues/7

I cannot solve the problem by setting export OPENBLAS_NUM_THREADS=4 export GOTO_NUM_THREADS=4 export OMP_NUM_THREADS=4 or export OPENBLAS_NUM_THREADS=1

Process ForkServerProcess-19: Process ForkServerProcess-15: Process ForkServerProcess-23: Traceback (most recent call last): File "/home/wangjw/bin/miniconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/wangjw/bin/miniconda3/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, self._kwargs) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/process.py", line 170, in _run cls._immediate_exit_when_closed(parent_alive_pipe) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/process.py", line 147, in _immediate_exit_when_closed t.start() File "/home/wangjw/bin/miniconda3/lib/python3.6/threading.py", line 846, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't start new thread Traceback (most recent call last): File "/home/wangjw/bin/miniconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/wangjw/bin/miniconda3/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, *self._kwargs) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/process.py", line 170, in _run cls._immediate_exit_when_closed(parent_alive_pipe) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/process.py", line 147, in _immediate_exit_when_closed t.start() File "/home/wangjw/bin/miniconda3/lib/python3.6/threading.py", line 846, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't start new thread distributed.nanny - WARNING - Worker process 1145 exited with status 1 Traceback (most recent call last): File "/home/wangjw/bin/miniconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/wangjw/bin/miniconda3/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(self._args, self._kwargs) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/process.py", line 170, in _run cls._immediate_exit_when_closed(parent_alive_pipe) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/process.py", line 147, in _immediate_exit_when_closed t.start() File "/home/wangjw/bin/miniconda3/lib/python3.6/threading.py", line 846, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't start new thread distributed.nanny - WARNING - Worker process 1157 exited with status 1 distributed.nanny - WARNING - Worker process 1134 exited with status 1 Process ForkServerProcess-46: Process ForkServerProcess-48: Traceback (most recent call last): File "/home/wangjw/bin/miniconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/wangjw/bin/miniconda3/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, self._kwargs) Traceback (most recent call last): File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/process.py", line 173, in _run target(*args, *kwargs) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/nanny.py", line 535, in _run t.start() File "/home/wangjw/bin/miniconda3/lib/python3.6/threading.py", line 846, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't start new thread File "/home/wangjw/bin/miniconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/wangjw/bin/miniconda3/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(self._args, self._kwargs) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/process.py", line 173, in _run target(*args, kwargs) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/nanny.py", line 535, in _run t.start() File "/home/wangjw/bin/miniconda3/lib/python3.6/threading.py", line 846, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't start new thread distributed.nanny - ERROR - Failed to start worker Traceback (most recent call last): File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/nanny.py", line 543, in run yield worker._start(worker_start_args) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run value = future.result() File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run yielded = self.gen.throw(exc_info) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/worker.py", line 466, in _start yield self._register_with_scheduler() File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run value = future.result() File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run yielded = self.gen.throw(exc_info) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/worker.py", line 308, in _register_with_scheduler connection_args=self.connection_args) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run value = future.result() File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run yielded = self.gen.throw(exc_info) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/comm/core.py", line 186, in connect quiet_exceptions=EnvironmentError) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run value = future.result() File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run yielded = self.gen.throw(*exc_info) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/comm/tcp.py", line 330, in connect *kwargs) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run value = future.result() File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run yielded = self.gen.throw(exc_info) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/tcpclient.py", line 226, in connect addrinfo = yield self.resolver.resolve(host, port, af) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run value = future.result() File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 326, in wrapper yielded = next(result) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/netutil.py", line 378, in resolve None, _resolve_addr, host, port, family) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 166, in run_in_executor return self.asyncio_loop.run_in_executor(executor, func, args) File "/home/wangjw/bin/miniconda3/lib/python3.6/asyncio/base_events.py", line 639, in run_in_executor return futures.wrap_future(executor.submit(func, args), loop=self) File "/home/wangjw/bin/miniconda3/lib/python3.6/concurrent/futures/thread.py", line 123, in submit self._adjust_thread_count() File "/home/wangjw/bin/miniconda3/lib/python3.6/concurrent/futures/thread.py", line 142, in _adjust_thread_count t.start() File "/home/wangjw/bin/miniconda3/lib/python3.6/threading.py", line 846, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't start new thread distributed.nanny - WARNING - Worker process 1237 exited with status 1 distributed.nanny - WARNING - Worker process 1246 exited with status 1 tornado.application - ERROR - Multiple exceptions in yield list Traceback (most recent call last): File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 883, in callback result_list.append(f.result()) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1147, in run yielded = self.gen.send(value) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/deploy/local.py", line 217, in _start_worker raise gen.TimeoutError("Worker failed to start") tornado.util.TimeoutError: Worker failed to start tornado.application - ERROR - Multiple exceptions in yield list Traceback (most recent call last): File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 883, in callback result_list.append(f.result()) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1147, in run yielded = self.gen.send(value) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/deploy/local.py", line 217, in _start_worker raise gen.TimeoutError("Worker failed to start") tornado.util.TimeoutError: Worker failed to start tornado.application - ERROR - Multiple exceptions in yield list Traceback (most recent call last): File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 883, in callback result_list.append(f.result()) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1147, in run yielded = self.gen.send(value) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/deploy/local.py", line 217, in _start_worker raise gen.TimeoutError("Worker failed to start") tornado.util.TimeoutError: Worker failed to start tornado.application - ERROR - Multiple exceptions in yield list Traceback (most recent call last): File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 883, in callback result_list.append(f.result()) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1147, in run yielded = self.gen.send(value) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/deploy/local.py", line 217, in _start_worker raise gen.TimeoutError("Worker failed to start") tornado.util.TimeoutError: Worker failed to start tornado.application - ERROR - Multiple exceptions in yield list Traceback (most recent call last): File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 883, in callback result_list.append(f.result()) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1147, in run yielded = self.gen.send(value) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/deploy/local.py", line 217, in _start_worker raise gen.TimeoutError("Worker failed to start") tornado.util.TimeoutError: Worker failed to start Traceback (most recent call last): File "pipeline.py", line 21, in adjacencies = grnboost2(ex_matrix, tf_names=tf_names, verbose=True) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/arboreto/algo.py", line 41, in grnboost2 early_stop_window_length=early_stop_window_length, limit=limit, seed=seed, verbose=verbose) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/arboreto/algo.py", line 109, in diy client, shutdown_callback = _prepare_client(client_or_address) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/arboreto/algo.py", line 157, in _prepare_client local_cluster = LocalCluster(diagnostics_port=None) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/deploy/local.py", line 141, in init self.start(ip=ip, n_workers=n_workers) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/deploy/local.py", line 171, in start self.sync(self._start, kwargs) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/deploy/local.py", line 164, in sync return sync(self.loop, func, *args, kwargs) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/utils.py", line 277, in sync six.reraise(error[0]) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/six.py", line 693, in reraise raise value File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/utils.py", line 262, in f result[0] = yield future File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run value = future.result() File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run yielded = self.gen.throw(exc_info) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/deploy/local.py", line 191, in _start yield [self._start_worker(self.worker_kwargs) for i in range(n_workers)] File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run value = future.result() File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 883, in callback result_list.append(f.result()) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1147, in run yielded = self.gen.send(value) File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/deploy/local.py", line 217, in _start_worker raise gen.TimeoutError("Worker failed to start") tornado.util.TimeoutError: Worker failed to start distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-60, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-3, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-9, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-6, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-12, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-47, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-50, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-39, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-55, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-61, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-42, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-58, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-40, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-21, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-25, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-27, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-29, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-31, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-33, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-35, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-37, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-41, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-16, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-10, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-13, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-4, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-7, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-1, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-18, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-49, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-54, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-59, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-64, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-36, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-52, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-56, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-32, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-44, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-53, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-51, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-57, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-2, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-63, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-5, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-11, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-8, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-14, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-22, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-24, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-17, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-20, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-26, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-28, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-30, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-38, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-62, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-34, started daemon)> distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-43, started daemon)> distributed.nanny - WARNING - Worker process 1282 was killed by unknown signal distributed.nanny - WARNING - Worker process 1122 was killed by unknown signal distributed.nanny - WARNING - Worker process 1218 was killed by unknown signal distributed.nanny - WARNING - Worker process 1215 was killed by unknown signal distributed.nanny - WARNING - Worker process 1169 was killed by unknown signal distributed.nanny - WARNING - Worker process 1183 was killed by unknown signal distributed.nanny - WARNING - Worker process 1276 was killed by unknown signal distributed.nanny - WARNING - Worker process 1209 was killed by unknown signal distributed.nanny - WARNING - Worker process 1136 was killed by unknown signal distributed.nanny - WARNING - Worker process 1107 was killed by unknown signal distributed.nanny - WARNING - Worker process 1114 was killed by unknown signal distributed.nanny - WARNING - Worker process 1128 was killed by unknown signal distributed.nanny - WARNING - Worker process 1243 was killed by unknown signal distributed.nanny - WARNING - Worker process 1203 was killed by unknown signal distributed.nanny - WARNING - Worker process 1110 was killed by unknown signal distributed.nanny - WARNING - Worker process 1206 was killed by unknown signal distributed.nanny - WARNING - Worker process 1267 was killed by unknown signal distributed.nanny - WARNING - Worker process 1224 was killed by unknown signal distributed.nanny - WARNING - Worker process 1273 was killed by unknown signal distributed.nanny - WARNING - Worker process 1252 was killed by unknown signal distributed.nanny - WARNING - Worker process 1258 was killed by unknown signal distributed.nanny - WARNING - Worker process 1230 was killed by unknown signal distributed.nanny - WARNING - Worker process 1119 was killed by unknown signal distributed.nanny - WARNING - Worker process 1105 was killed by unknown signal distributed.nanny - WARNING - Worker process 1151 was killed by unknown signal distributed.nanny - WARNING - Worker process 1130 was killed by unknown signal distributed.nanny - WARNING - Worker process 1295 was killed by unknown signal distributed.nanny - WARNING - Worker process 1163 was killed by unknown signal distributed.nanny - WARNING - Worker process 1249 was killed by unknown signal distributed.nanny - WARNING - Worker process 1186 was killed by unknown signal distributed.nanny - WARNING - Worker process 1124 was killed by unknown signal distributed.nanny - WARNING - Worker process 1221 was killed by unknown signal distributed.nanny - WARNING - Worker process 1148 was killed by unknown signal distributed.nanny - WARNING - Worker process 1175 was killed by unknown signal distributed.nanny - WARNING - Worker process 1142 was killed by unknown signal distributed.nanny - WARNING - Worker process 1112 was killed by unknown signal distributed.nanny - WARNING - Worker process 1195 was killed by unknown signal Exception ignored in: <bound method LocalCluster.del of LocalCluster('tcp://127.0.0.1:43730', workers=0, ncores=0)> Traceback (most recent call last): File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/deploy/local.py", line 340, in del File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/deploy/local.py", line 291, in close File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/utils.py", line 425, in run_sync File "/home/wangjw/bin/miniconda3/lib/python3.6/site-packages/distributed/utils.py", line 272, in sync tornado.util.TimeoutError: timed out after 20 s.

martin-frbg commented 6 years ago

Which version of OpenBLAS are you using, and with what options was it built ? I see you were told that it is an OpenBLAS problem, but none of the messages in your log is from OpenBLAS itself so there is not much information to go on. (There were some known problems with thread memory initialization in recent versions however)

brada4 commented 6 years ago

Output similar to #1668 is captured in https://github.com/aertslab/pySCENIC/issues/19 EDIT but it seems to have been addressed already over there.

brada4 commented 6 years ago

@wangjiawen2013 couple of things to try: run single-threaded openblas by setting (that should avoid code creating new threads by library, shown in error in pySCENIC issue): export OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 python script.py ??? could you run your python script from python in strace to 1) at least prove openblas is present in picture 2) eventually discover that "something weitd" interferes or not. i.e strace python script.py It will be tons of output, only attach, do not copypaste, and only if you cannot sort out whats up.

brada4 commented 6 years ago

@martin-frbg i remember some fallback code for single thread for malloc faults. Maybe something similar could keep this running albeit at slow speed?

martin-frbg commented 6 years ago

The pySCENIC issue looks interesting, and I found an old RedHat bug that provides some context although it is for a completely unrelated program https://bugzilla.redhat.com/show_bug.cgi?id=1068741 So it does seem possible that some other process is using enough threads already to get close to the default limit on the number of processes a user can start on the system. The quick solution would then probably be to increase this limit on the affected system. And indeed the behaviour of OpenBLAS needs to be changed to at least retry a failed pthread_create once after a short wait - this is one of the remnants of the original decision to just exit on any failure. (On the other hand, it looks like the original GotoBLAS2 would have just segfaulted in the same scenario.) I am not entirely sure yet if it would be possible to adjust blas_num_threads and continue with what was available.

brada4 commented 6 years ago

I asked to trace suspecting say some static or 'in same directory' atlas or mkl being planted in same processes and each spins up 'nproc' threads then python threading spins Same amount of each , leading to nproc*23 threads. I found a bug in rhel limiting threads to 1024 with the obvious fix to truncate offending file so that package manager leaves it alone: https://bugzilla.redhat.com/show_bug.cgi?id=919793

My calculation would set upper limit of 18 cpus in system.

martin-frbg commented 6 years ago

Depends on what else is running on that system (I think none of the several issues opened for this problem mentions if this is a happening on a desktop system or compute cluster or whatever ?) If setting OPENBLAS_NUM_THREADS=1 really does not work, perhaps there are already so many threads from tornado/pyscenic or whatever running that OpenBLAS cannot even start a single worker thread.

brada4 commented 6 years ago

distributed == mpi multiprocessing = omp Both together plus pthreads could be nproc ** 3... Some details on processor setup could sched some light...

ghuls commented 6 years ago

It can be that you hit the limit of open files. I have personally experienced problems with it with programs that spwan many threads (not related to OpenBLAS) on systems that have a lot of cores:

# Check soft limit of open files allowed:
ulimit -Sn
1024

# Check hard limit of open files allowed:
$ ulimit -Hn
4096

# Set soft limit value to hard limit:
$ ulimit -n 4096

# Check new soft limit of open files.
$ ulimit -Sn
4096

If you need more open files than your hard limit, check the following links to increase it:

https://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/ https://superuser.com/questions/1200539/cannot-increase-open-file-limit-past-4096-ubuntu/1200818#1200818

brada4 commented 6 years ago

reading comment in file limiting number of processes to 4096 - 20-nproc.conf, installed by "pam" package, it is meant to prevent accidental fork bombs, which quite applies to the case of having 3 threading layers spawning nproc instances of next layer. Openblas is designed to have one thread per cpu core as it does a lot to optimize memory accesses considering caches between cpu and RAM. If you share cache between two (or thousands) threads you get main memory access where L1d access was presumed and lose like 90% of performance.

... That is one thread on core will finish in 1 minute, 2 threads on one core each doing half of same task will finish in 10 minutes, a bit over-dramatized, but it is always much worse than plain linear degradation. Actually even gcc would optimize memory accesses assuming cache configuration on default CPU it is configured for, like 512k L3 for early amd64/emt64, so you dont feel the effect so early on modern CPUs with significantly bigger caches.

brada4 commented 6 years ago

@ghuls yes, that applies too, but we are dealing with resource hog first.

brada4 commented 6 years ago

@wangjiawen2013 how is it going? Did you try single-thread openblas in the meantime? Any ideas from syscall trace?

wangjiawen2013 commented 6 years ago

@wangjiawen2013 how is it going? Did you try single-thread openblas in the meantime? Any ideas from syscall trace?

It doesn't work either. Perhaps pyscenic is immature, I decide to wait for some days to see if there are any alternative methods to treat with my data.

martin-frbg commented 6 years ago

Unfortunately there is still very little for anybody to go on - we still do not know the OpenBLAS version, your current ulimit settings or the hardware you want to run this on. (If you are submitting jobs to a compute cluster, perhaps there are other settings to adjust before ulimit will work). Setting OPENBLAS_NUM_THREADS=1 should have worked, unless there is something in pyscenic that sets it to a higher value again. As a last resort installing (which may mean building your own version of) an OpenBLAS without mulithreading support should work (if you can make sure that this version is picked up by pyscenic.)

martin-frbg commented 5 years ago

Apparently fixed in miniconda now (according to resolution in pyscenic issue 22 linked above)

martin-frbg commented 5 years ago

Thread starter finally updated the pyscenic ticket to confirm he needed to increase his system limit on the number of processes (nproc).