Open hecmay opened 6 years ago
In order to be able to run multiple independent DATuner instances on the same server regardless of user, dispyscheduler needs to be implemented. Currently, DATuner only allows for parallelization among one instance through the '-p' argument (letting the maximum be 8 -- the max number of cpus total) and through the configuration file indicating how many machines will be used. (If 2 machines are used, the maximum for '-p' increases to 16.)
Sorry for the missing words. I was trying to launch multiple independent DATuner instances on multiple machines. And only the first instance will work.
The error is most likely caused by communication conflicts in the local area network with multiple dispy job clusters. A possible solution is to add an unique identifier to each cluster with --secret
parameter like:
# assign a uuid for each cluster
import uuid
secret = uuid.uuid4()
cluster = dispy.JobCluster(tune_function,
depends = ['package.zip'],
secret = str(secret),
cleanup = False)
and assign the same uuid with servers (running dispynode.py
) in datuner.py
for i in range(len(machines)):
machine_addr = machines[i % len(machines)]
subprocess.call(['scp', DATUNER_HOME + '/releases/Linux_x86_64/install/bin/dispynode.py', machine_addr + ':' +workspace]);
sshProcess = subprocess.Popen(['ssh',
machine_addr],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
universal_newlines=True,
bufsize=0)
sshProcess.stdin.write("cd " + workspace + "\n")
sshProcess.stdin.write("python dispynode.py --serve 1 --clean --secret " + str(secret) + " --dest_path_prefix dispytmp_" + str(i) + "\n")
sshProcess.stdin.close()
@Hecmay Can you elaborate? What does communication conflict mean? Also, is "--secret" a standard option defined for dispy? If so, what is the typical usage of this option according to the dispy document?
I also notice that we are hardcoding "/releases/Linux_x86_64/install/bin/" in the program. This piece of code will break our tool when it's compiled on a 32b machine.
Communication conflict means that the server (worker) nodes cannot match with and receive computation tasks correctly from the client (master) node they belong to, since there are more than one clusters existing on the local areas network, but without valid information to determine their identities.
secret
is a pre-defined standard option in dispy/dispynode. In the document of dispy (master):
secret
is a string that is (hashed and) used for handshaking of communication with nodes; i.e., this cluster will only work with nodes that use same secret.
and the secret
of the dispynode (worker) nodes will be assigned by parameter --secret
like
python dispynode.py --serve 1 --clean --secret 1111
Can we add a unit test to make sure we don't run into the same issue again?
Sure, the unit test for this case can be implemented by setting up the CircleCI configuration.
What might be the cause of the following error? This happens even when I work on a single machine for a single datuner instance.
Traceback (most recent call last):
File "dispynode.py", line 1988, in <module>
_dispy_node = _DispyNode(**_dispy_config)
File "dispynode.py", line 232, in __init__
addrinfo = dispy.host_addrinfo(host=ip_addr)
AttributeError: 'module' object has no attribute 'host_addrinfo'
@eu49 Could you create a minimal test case to reproduce the issue?
Or tell Shaojie how to reproduce this issue with your current design.
I was running VTR flow based on machine learning (vtr-ml-tc). It doesn't matter which design you tune, but you can try ch_intrinsics. The sample configuration file (vtr_sample0.py
) is under the cfg folder.
CircleCI 2.0 no longer support SSH between nodes: https://discuss.circleci.com/t/ssh-between-machines/13828
So the unit test case for this case cannot be implemented on CircleCI platform. As a workaround, we can create the case on our server using xdist (pytest distributed testing plugin)
@Hecmay The parallelization fix works fine for ~30 iterations, but then the running jobs stop with the following error message. (The error message may not pop up immediately after the job stops functioning)
failed to set up [node/ip address] for compute
Could you try to reproduce this bug by testing with higher budget (~1000) and timeout (1d)?
@eu49 I cannot reproduce the error as you mentioned. But when running DATuner instances in parallel, the running job of one DATuner instance does stop, sometimes, after some iterations, which keeps printing out the job status until the time limit is reached:
job#: 0
Total time elapsed: 20171.812294
Total time elapsed: 20181.176986
And the other DATuner instance works fine.
But this issue does not happen in every trail. I will tried to make more trails to confirm it.
@Hecmay When I ran datuner-ml in the past, I ran into the same problem as @eu49 . Out of curiosity, which instance of DATuner stopped and which one ran to completion? Was the one that was started first the one that stops? The error message is shown only if the debug options are set on dispynode.py (-d) and jobcluster (loglevel=dispy.logger.DEBUG.
@jg925 Are you running the datuner-ml with uuid
identifier for each cluster?
In my first trail run, the DATuner instance launched first will stop working.
But I failed to reproduce the same issue in the later two trials (in which everything works well), and I am still running additional trails on zhang-08 / zhang-09
now.
@Hecmay Could you reproduce the problem? Are you running datuner-ml? I did my experiments on vtr-ml-tc, with budget 1000 and timeout 1d. It doesn't take a day to encounter the bug though. You should be able to observe it in around 30 iterations, which doesn't take hours.
@eu49 Partially. In my case, the running case stopped without any error message (in debug mode).
@Hecmay any progress fixing the problem? If dispy is too buggy, we may have to consider eventually dropping this package and use a different framework instead.
@zhangzhiru Dispynode seems to be removed from the client (master) in the multiple clusters LAN, before the the nodes cloud reply the master with valid results:
epoch: 500
runs_per_epoch: 2
job#: 0
job#: 12018-09-04 17:43:34 dispy - Running job 139895253823568 on 128.84.48.154
2018-09-04 20:32:19 dispy - Node 128.84.48.155 is not responding; removing it (3.0,
1536106700.07, 1536107539.63)
2018-09-04 20:32:19 dispy - Job 140259948089424 scheduled on 128.84.48.155 abandoned
2018-09-04 20:32:19 dispy - Job 140259926474320 scheduled on 128.84.48.155 abandoned
2018-09-04 20:32:19 dispy - Job 140259926473552 scheduled on 128.84.48.155 abandoned
2018-09-04 20:32:19 dispy - Ignoring invalid reply for job 140259926474320 from 128.84.48.155
2018-09-04 20:32:19 dispy - Ignoring invalid reply for job 140259926473552 from 128.84.48.155
2018-09-04 20:32:19 dispy - Ignoring invalid reply for job 140259948089424 from 128.84.48.155
I plan to try some potential solutions before looking for some alternative framworks:
timeout
and add reentrant
mechanism (which enables the master re-assign the task to other nodes if the particular node fails to response and is removed) SharedScheduler
to replace JobScheduler
in multi-cluster networkThe client failed to receive the UDP reply packets from the client before removing the node. And the received reply packets will be regarded as invalid reply
if the node is regarded as dead and removed from the cluster by the client.
A possible explanation: The OS UDP buffer might be filled up with reply packets from other clusters. Failing to process these packets, the client hangs indefinitely. by quitting with Crtl+C
and let in new packets, the receiving process of dispy client resumes and msg starts printing out.
Solution: Adding a dispyscheduler
between dispy client
and dispynodes
will resolve the issue. the dispyscheduler
establishes a stable TCP connection with the client, collecting UDP packets from the local network and schedule the jobs to nodes on behalf of dispy client
. (Another benefit is that the nodes can execute workload from different clusters in different cycles, if secret is not set)
@Hecmay have you pushed your fix?
@zhangzhiru The fix has been pushed to the datuner-ml
repo of machine learning version hosted on github.coecis.cornell.edu
. Do I also need to push the fix into this repo?
@Hecmay The fix you saw in datuner-ml is pushed by Jinny. She tried using dispyscheduler to help you fix the issue, but did not finalize the implementation as she is helping with Quartus' machine learning part. Did you test her implementation? She said it is not working properly. Could you please push your fix to this repo? Thanks.
@eu49 OK. I suggest not to push to the repository before making sure the program functions well (since the unit test case of the parallelization issue is not available on the CircleCI).
I have not tested her implementation yet, and I may remove some of her implementation.
@Hecmay I had accidentally included the dispyscheduler changes with the database for machine learning updates commit. I was going to make another commit to revert the dispyscheduler implementation attempt, but if you will be modifying it to what you had before, please do. Sorry for the confusing commit to datuner-ml.
@jg925 No worries. You may keep the dispyscheduler
implementation and make your commit first. I will re-modify the dispyscheduler
part later.
When launching multiple independent DATuner workflows sequentially on different machines, only the first works and all other flows will just hang as following, without any further response.