NLeSC / noodles

Computational workflow engine, making distributed computing in Python easy!
http://nlesc.github.io/noodles
Apache License 2.0
21 stars 7 forks source link

Threads contention #39

Closed felipeZ closed 6 years ago

felipeZ commented 8 years ago

I have noticed that external tasks invoked with subprocess have a significant delay between the time they are scheduled by noodles and the time that they start to run. Noodles is reporting something similar to the following

[15:39:15] Job cp2k_job started
[15:39:15] Renaming job cp2k_job to cp2k_job.007
[15:39:30] Job cp2k_job.007 finished with status 'successful' 
[15:39:31] Job cp2k_job started
[15:39:31] Renaming job cp2k_job to cp2k_job.008
[15:39:39] Job cp2k_job.008 finished with status 'successful' 
[15:39:40] Job cp2k_job started
[15:39:40] Renaming job cp2k_job to cp2k_job.009
[15:39:54] Job cp2k_job.009 finished with status 'successful' 
[15:39:54] Job cp2k_job started
[15:39:54] Renaming job cp2k_job to cp2k_job.010
[15:40:03] Job cp2k_job.010 finished with status 'successful'

It seems that the jobs are scheduled inmediatly, But when I measure the time spent inside the functions and the time reported by the external process, I get the following times ( in seconds)

function call reported by package
15.2 11.2
9.0 5.0
14.6 11.1
8.6 5.0

It seems that the jobs are not scheduled immediately one after the other and that there is some delay. I run cProfile but it only shows the evident, that most of the time the threads are lock waiting for the external packages to finish.

The result provided by cProfile is

         798506 function calls (747066 primitive calls) in 801.639 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    391/1    0.015    0.000  801.641  801.641 {built-in method builtins.exec}
        1    0.000    0.000  801.641  801.641 script_workflow.py:1(<module>)
        1    0.000    0.000  800.791  800.791 script_workflow.py:12(remote_function)
        1    0.000    0.000  800.405  800.405 workflow_coupling.py:25(generate_pyxaid_hamiltonians)
        1    0.000    0.000  800.360  800.360 packages.py:171(run)
        1    0.000    0.000  800.360  800.360 packages.py:187(call_default)
        1    0.000    0.000  800.360  800.360 run_with_prov.py:256(run_parallel_opt)
        1    0.001    0.001  800.355  800.355 scheduler.py:71(run)
       32    0.001    0.000  800.176   25.006 run_with_prov.py:137(f)
      827    0.002    0.000  799.860    0.967 queue.py:147(get)
       32    0.000    0.000  799.858   24.996 haploid.py:179(junction)
       32    0.000    0.000  799.857   24.996 thread_pool.py:27(fn)
       32    0.000    0.000  799.856   24.996 queue.py:28(source)
       33    0.000    0.000  799.856   24.238 threading.py:261(wait)
     1882  799.856    0.425  799.856    0.425 {method 'acquire' of '_thread.lock' objects}

Any suggestion about what is causing the delay?

jhidding commented 8 years ago

The messages that you are showing are PLAMS messages. Can you see in the profiler which '_thread.lock' instances are eating most of the time? Noodles uses locks when writing to the cache (through TinyDB), if there is a lot of data there that could be a bottle-neck. Edit: but four seconds is way too long for that.

felipeZ commented 8 years ago

According to cProfile the function that is consuming most of the time is thread_pool that calls the wait method from the condition class, which is the responsible for most of the time consumption.