Dispy parallelization issue

hecmay commented 6 years ago

When launching multiple independent DATuner workflows sequentially on different machines, only the first works and all other flows will just hang as following, without any further response.

2018-08-19 20:35:18 dispynode - dispynode version: 4.9.1, PID: 25844
2018-08-19 20:35:18 pycos - version 4.8.0 with epoll I/O notifier
2018-08-19 20:35:18 dispynode - "en-ec-zhang08.coecis.cornell.edu" serving 8 cpus
2018-08-19 20:35:18 dispy - Transfer of computation "tune_function" to 128.84.48.155 failed
2018-08-19 20:35:18 dispy - Failed to setup 128.84.48.155 for compute "tune_function": -1
2018-08-19 20:35:18 dispynode - Computation "1534725316722" is not valid

jg925 commented 6 years ago

In order to be able to run multiple independent DATuner instances on the same server regardless of user, dispyscheduler needs to be implemented. Currently, DATuner only allows for parallelization among one instance through the '-p' argument (letting the maximum be 8 -- the max number of cpus total) and through the configuration file indicating how many machines will be used. (If 2 machines are used, the maximum for '-p' increases to 16.)

hecmay commented 6 years ago

Sorry for the missing words. I was trying to launch multiple independent DATuner instances on multiple machines. And only the first instance will work.

hecmay commented 6 years ago

The error is most likely caused by communication conflicts in the local area network with multiple dispy job clusters. A possible solution is to add an unique identifier to each cluster with --secret parameter like:

# assign a uuid for each cluster
import uuid
secret = uuid.uuid4()
cluster = dispy.JobCluster(tune_function,
                           depends = ['package.zip'],
                           secret  = str(secret),
                           cleanup = False)

and assign the same uuid with servers (running dispynode.py) in datuner.py

  for i in range(len(machines)):
    machine_addr = machines[i % len(machines)]

    subprocess.call(['scp', DATUNER_HOME + '/releases/Linux_x86_64/install/bin/dispynode.py', machine_addr + ':' +workspace]);
    sshProcess = subprocess.Popen(['ssh',
                                   machine_addr],
                                   stdin=subprocess.PIPE,
                                   stdout=subprocess.PIPE,
                                   universal_newlines=True,
                                   bufsize=0)
    sshProcess.stdin.write("cd " + workspace + "\n")
    sshProcess.stdin.write("python dispynode.py --serve 1 --clean --secret " + str(secret) + " --dest_path_prefix dispytmp_" + str(i) + "\n")
    sshProcess.stdin.close()

zhangzhiru commented 6 years ago

@Hecmay Can you elaborate? What does communication conflict mean? Also, is "--secret" a standard option defined for dispy? If so, what is the typical usage of this option according to the dispy document?

zhangzhiru commented 6 years ago

I also notice that we are hardcoding "/releases/Linux_x86_64/install/bin/" in the program. This piece of code will break our tool when it's compiled on a 32b machine.

hecmay commented 6 years ago

Communication conflict means that the server (worker) nodes cannot match with and receive computation tasks correctly from the client (master) node they belong to, since there are more than one clusters existing on the local areas network, but without valid information to determine their identities.
secret is a pre-defined standard option in dispy/dispynode. In the document of dispy (master):

secret is a string that is (hashed and) used for handshaking of communication with nodes; i.e., this cluster will only work with nodes that use same secret.

and the secret of the dispynode (worker) nodes will be assigned by parameter --secret like python dispynode.py --serve 1 --clean --secret 1111

zhangzhiru commented 6 years ago

Can we add a unit test to make sure we don't run into the same issue again?

hecmay commented 6 years ago

Sure, the unit test for this case can be implemented by setting up the CircleCI configuration.

ecenurustun commented 6 years ago

What might be the cause of the following error? This happens even when I work on a single machine for a single datuner instance.

Traceback (most recent call last): File "dispynode.py", line 1988, in <module> _dispy_node = _DispyNode(**_dispy_config) File "dispynode.py", line 232, in __init__ addrinfo = dispy.host_addrinfo(host=ip_addr) AttributeError: 'module' object has no attribute 'host_addrinfo'

hecmay commented 6 years ago

@eu49 Could you create a minimal test case to reproduce the issue?

zhangzhiru commented 6 years ago

Or tell Shaojie how to reproduce this issue with your current design.

ecenurustun commented 6 years ago

I was running VTR flow based on machine learning (vtr-ml-tc). It doesn't matter which design you tune, but you can try ch_intrinsics. The sample configuration file (vtr_sample0.py) is under the cfg folder.

hecmay commented 6 years ago

CircleCI 2.0 no longer support SSH between nodes: https://discuss.circleci.com/t/ssh-between-machines/13828

So the unit test case for this case cannot be implemented on CircleCI platform. As a workaround, we can create the case on our server using xdist (pytest distributed testing plugin)

ecenurustun commented 6 years ago

@Hecmay The parallelization fix works fine for ~30 iterations, but then the running jobs stop with the following error message. (The error message may not pop up immediately after the job stops functioning)

failed to set up [node/ip address] for compute

Could you try to reproduce this bug by testing with higher budget (~1000) and timeout (1d)?

hecmay commented 6 years ago

@eu49 I cannot reproduce the error as you mentioned. But when running DATuner instances in parallel, the running job of one DATuner instance does stop, sometimes, after some iterations, which keeps printing out the job status until the time limit is reached:

job#: 0                                                                         
Total time elapsed: 20171.812294                       
Total time elapsed: 20181.176986

And the other DATuner instance works fine.

But this issue does not happen in every trail. I will tried to make more trails to confirm it.

jg925 commented 6 years ago

@Hecmay When I ran datuner-ml in the past, I ran into the same problem as @eu49 . Out of curiosity, which instance of DATuner stopped and which one ran to completion? Was the one that was started first the one that stops? The error message is shown only if the debug options are set on dispynode.py (-d) and jobcluster (loglevel=dispy.logger.DEBUG.

hecmay commented 6 years ago

@jg925 Are you running the datuner-ml with uuid identifier for each cluster?

In my first trail run, the DATuner instance launched first will stop working.

But I failed to reproduce the same issue in the later two trials (in which everything works well), and I am still running additional trails on zhang-08 / zhang-09 now.

ecenurustun commented 6 years ago

@Hecmay Could you reproduce the problem? Are you running datuner-ml? I did my experiments on vtr-ml-tc, with budget 1000 and timeout 1d. It doesn't take a day to encounter the bug though. You should be able to observe it in around 30 iterations, which doesn't take hours.

hecmay commented 6 years ago

@eu49 Partially. In my case, the running case stopped without any error message (in debug mode).

zhangzhiru commented 6 years ago

@Hecmay any progress fixing the problem? If dispy is too buggy, we may have to consider eventually dropping this package and use a different framework instead.

hecmay commented 6 years ago

@zhangzhiru Dispynode seems to be removed from the client (master) in the multiple clusters LAN, before the the nodes cloud reply the master with valid results:

epoch: 500
runs_per_epoch: 2
job#: 0
job#: 12018-09-04 17:43:34 dispy - Running job 139895253823568 on 128.84.48.154

2018-09-04 20:32:19 dispy - Node 128.84.48.155 is not responding; removing it (3.0,
1536106700.07, 1536107539.63)
2018-09-04 20:32:19 dispy - Job 140259948089424 scheduled on 128.84.48.155 abandoned
2018-09-04 20:32:19 dispy - Job 140259926474320 scheduled on 128.84.48.155 abandoned
2018-09-04 20:32:19 dispy - Job 140259926473552 scheduled on 128.84.48.155 abandoned
2018-09-04 20:32:19 dispy - Ignoring invalid reply for job 140259926474320 from 128.84.48.155
2018-09-04 20:32:19 dispy - Ignoring invalid reply for job 140259926473552 from 128.84.48.155
2018-09-04 20:32:19 dispy - Ignoring invalid reply for job 140259948089424 from 128.84.48.155

I plan to try some potential solutions before looking for some alternative framworks:

change the timeout and add reentrant mechanism (which enables the master re-assign the task to other nodes if the particular node fails to response and is removed)
use the SharedScheduler to replace JobScheduler in multi-cluster network

hecmay commented 6 years ago

The client failed to receive the UDP reply packets from the client before removing the node. And the received reply packets will be regarded as invalid reply if the node is regarded as dead and removed from the cluster by the client.
A possible explanation: The OS UDP buffer might be filled up with reply packets from other clusters. Failing to process these packets, the client hangs indefinitely. by quitting with Crtl+C and let in new packets, the receiving process of dispy client resumes and msg starts printing out.
Solution: Adding a dispyscheduler between dispy client and dispynodes will resolve the issue. the dispyscheduler establishes a stable TCP connection with the client, collecting UDP packets from the local network and schedule the jobs to nodes on behalf of dispy client . (Another benefit is that the nodes can execute workload from different clusters in different cycles, if secret is not set)

zhangzhiru commented 6 years ago

@Hecmay have you pushed your fix?

hecmay commented 6 years ago

@zhangzhiru The fix has been pushed to the datuner-ml repo of machine learning version hosted on github.coecis.cornell.edu. Do I also need to push the fix into this repo?

ecenurustun commented 6 years ago

@Hecmay The fix you saw in datuner-ml is pushed by Jinny. She tried using dispyscheduler to help you fix the issue, but did not finalize the implementation as she is helping with Quartus' machine learning part. Did you test her implementation? She said it is not working properly. Could you please push your fix to this repo? Thanks.

hecmay commented 6 years ago

@eu49 OK. I suggest not to push to the repository before making sure the program functions well (since the unit test case of the parallelization issue is not available on the CircleCI).

I have not tested her implementation yet, and I may remove some of her implementation.

jg925 commented 6 years ago

@Hecmay I had accidentally included the dispyscheduler changes with the database for machine learning updates commit. I was going to make another commit to revert the dispyscheduler implementation attempt, but if you will be modifying it to what you had before, please do. Sorry for the confusing commit to datuner-ml.

hecmay commented 6 years ago

@jg925 No worries. You may keep the dispyscheduler implementation and make your commit first. I will re-modify the dispyscheduler part later.

cornell-zhang / datuner

Dispy parallelization issue #31