SrinivasMushnoori / repex

An implementation of the RepEx package as an application written in the EnTK API
MIT License
2 stars 2 forks source link

Async terminates immediately after one cycle. #29

Closed SrinivasMushnoori closed 5 years ago

SrinivasMushnoori commented 5 years ago

Async (Sliding Window implementation) appears to run to completion, but I do not find any pipelines ever getting suspended.

Terminal output:

submit 4 unit(s)
        ....                                                                  ok
Update: Task task.0001 in state SUBMITTED
Update: Task task.0002 in state SUBMITTED
Update: Task task.0003 in state SUBMITTED
Update: Task task.0004 in state SUBMITTED
Update: Task task.0001 in state EXECUTED
Update: Task task.0001 in state DEQUEUEING
Update: Task task.0001 in state DEQUEUED
Update: Task task.0001 in state DONE
Update: Task task.0003 in state EXECUTED
Update: Stage stage.0001 in state DONE
Update: Pipeline pipeline.0001 in state DONE
Update: Task task.0003 in state DEQUEUEING
Update: Task task.0003 in state DEQUEUED
Update: Task task.0004 in state EXECUTED
Update: Task task.0003 in state DONE
Update: Stage stage.0003 in state DONE
Update: Pipeline pipeline.0003 in state DONE
Update: Task task.0004 in state DEQUEUEING
Update: Task task.0004 in state DEQUEUED
Update: Task task.0004 in state DONE
Update: Stage stage.0004 in state DONE
Update: Pipeline pipeline.0004 in state DONE
Update: Task task.0002 in state EXECUTED
Update: Task task.0002 in state DEQUEUEING
Update: Task task.0002 in state DEQUEUED
Update: Task task.0002 in state DONE
Update: Stage stage.0002 in state DONE
Update: Pipeline pipeline.0002 in state DONE
wait for 1 pilot(s)
                                                                              ok
closing session re.session.mcewan.engr.rutgers.edu.scm177.017942.0006          \
close pilot manager                                                            \
wait for 1 pilot(s)
                                                                         timeout
                                                                              ok
session lifetime: 96.3s                                                       ok

None of the pipelines are ever getting suspended or resumed. Secondly: the exchange task (which by the way we have set up the replicas should be task.0005 does not even spawn. I dug into the sandbox and sure enough, it isn't even there.

So basically: MD stage 1 executes on all pipelines just fine, then the pipeline just....stops.

SrinivasMushnoori commented 5 years ago

Testing has revealed that the pipeline.suspend() statement is being completely bypassed/ignored, as is evident from this testing script.

This is an EnTK level issue and is being tracked here

SrinivasMushnoori commented 5 years ago

This SEEMS to have been fixed at the EnTK level, but there is a change in the adaptivity API. Being investigated.

SrinivasMushnoori commented 5 years ago

Updating: Endless loop of the following error is encountered:

2019-04-12 15:26:44,229: radical.entk.wfprocessor.0001: wfprocessor                     : dequeue-thread : ERROR   : Execution failed in post_exec of stage stage.0004
Traceback (most recent call last):
  File "/home/scm177/VirtualEnvs/Env_RepEx/local/lib/python2.7/site-packages/radical/entk/appman/wfprocessor.py", line 388, in _dequeue
    resumed_pipes = stage.post_exec['function'](*stage.post_exec['args'])
  File "async_sliding_window.py", line 414, in _after_md
    self._check_ex(self)
  File "async_sliding_window.py", line 201, in _check_exchange
    self._exchange_list = self._sliding_window(self._sorted_waitlist, self._exchange_size, self._window_size)
  File "async_sliding_window.py", line 273, in _sliding_window
    rid_start = replica.rid - window_size/2 # "replica" here is for some reason being seen as a list type object.
AttributeError: 'list' object has no attribute 'rid'
2019-04-12 15:26:44,230: radical.entk.wfprocessor.0001: wfprocessor                     : dequeue-thread : ERROR   : Unable to receive message from completed queue: 'list' 
object has no attribute 'rid'
Traceback (most recent call last):
  File "/home/scm177/VirtualEnvs/Env_RepEx/local/lib/python2.7/site-packages/radical/entk/appman/wfprocessor.py", line 388, in _dequeue
    resumed_pipes = stage.post_exec['function'](*stage.post_exec['args'])
  File "async_sliding_window.py", line 414, in _after_md
    self._check_ex(self)
  File "async_sliding_window.py", line 201, in _check_exchange
    self._exchange_list = self._sliding_window(self._sorted_waitlist, self._exchange_size, self._window_size)
  File "async_sliding_window.py", line 273, in _sliding_window
    rid_start = replica.rid - window_size/2 # "replica" here is for some reason being seen as a list type object.
AttributeError: 'list' object has no attribute 'rid'
2019-04-12 15:26:44,230: radical.entk.wfprocessor.0001: wfprocessor                     : dequeue-thread : ERROR   : Error in dequeue-thread: 'list' object has no attribute
 'rid'
Traceback (most recent call last):
  File "/home/scm177/VirtualEnvs/Env_RepEx/local/lib/python2.7/site-packages/radical/entk/appman/wfprocessor.py", line 388, in _dequeue
    resumed_pipes = stage.post_exec['function'](*stage.post_exec['args'])
  File "async_sliding_window.py", line 414, in _after_md
    self._check_ex(self)
  File "async_sliding_window.py", line 201, in _check_exchange
    self._exchange_list = self._sliding_window(self._sorted_waitlist, self._exchange_size, self._window_size)
  File "async_sliding_window.py", line 273, in _sliding_window
    rid_start = replica.rid - window_size/2 # "replica" here is for some reason being seen as a list type object.
AttributeError: 'list' object has no attribute 'rid'

This occurs after the first replica completes MD.

SrinivasMushnoori commented 5 years ago

The above issue has been fixed.