Spirals-Team / hadoop-benchmark

Docker containers to build an Hadoop infrastructure and experiment feedback control loops atop of it.
Apache License 2.0
9 stars 5 forks source link

Error creating machine: Error waiting for machine to be running: Maximum number of retries (60) exceeded #4

Closed ghost closed 8 years ago

ghost commented 8 years ago

I am on Grid5000 and have a question, again. I have got error message above for command: CONFIG=g5k_cluster ./cluster.sh create-cluster

I have already set IP addresses and added home to Path: /usr/local/bin:/usr/bin:/bin:/grid5000/code/bin:/home/kakos

How should I fix this?

ghost commented 8 years ago

In status:

Unable to query docker version: Unable to read TLS config: open /home/kakos/.docker/machine/machines/g5k-hadoop-consul/server.pem: no such file or directory

zhangbbo commented 8 years ago

In g5k_cluster file, do you modify DRIVER_OPTS?

DRIVER_OPTS=$(echo "--generic-ssh-key=/home/bzhang/.ssh/id_rsa --generic-ssh-user=root --generic-ssh-port=22")

There is a "bzhang" should also be changed.

ghost commented 8 years ago

Yes, sure, it is kakos now. If I try to create again cluster:

generic driver does not support start

zhangbbo commented 8 years ago

I have one question. For the machines you got from grid5000, which OS you installed? jessie-x64-base of Debian Jessie ?

ghost commented 8 years ago

Yes, as written in readme.

zhangbbo commented 8 years ago

I just have a try. I repeat create_cluster and can meet the similar problem sometimes. It seems random to happen.

Please update docker-machine to 0.6.0 version using below command: $ curl -L https://github.com/docker/machine/releases/download/v0.6.0/docker-machine-`uname -s-uname -m` > /home/bzhang/docker-machine && chmod +x /home/bzhang/docker-machine

And please have a try again. Furthermore, I also don't understand why. You must reinstall OS on your machines to refresh the informations left from last installation.

zhangbbo commented 8 years ago

And do not forget to clean your docker-machie informations. :)

zhangbbo commented 8 years ago

Just in case, if you meet the problem that it can't create container and require you to create them manually, this is caused by Bash in Grid5000.

Please modify "nonexistent" to "*" in cluster.sh on line 279.

zhangbbo commented 8 years ago

Sorry for typo, the command to update docker-machine should be:

$ curl -L https://github.com/docker/machine/releases/download/v0.6.0/docker-machine-`uname -s-uname -m` >/home/bzhang/docker-machine && chmod +x /home/bzhang/docker-machine

zhangbbo commented 8 years ago

There are some problem with the comments. I have update README. Please have a check. :)

ghost commented 8 years ago

Thanks again! Is my problem maybe with my oarsubcommand? I have read some more tutorials and I am uncertain.

zhangbbo commented 8 years ago

No, oarsub command only used by Grid5000 to get machines in my opinion. These problems concern Docker and Docker-machine. I checked stackoverflow and Github. These problems always happen randomly. I think the problems are probably caused by docker and docker-machine with their 'generic' driver.

ghost commented 8 years ago

Is there any way to clear docker-machine machines? It throws some error again, but destroy-cluster flag isn't working.

zhangbbo commented 8 years ago

Really, destroy_cluster doesn't work in Grid5000. You can check machines info by command 'docker-machine ls' And then, delete the machines by 'docker-machine rm (machine name)'.

ghost commented 8 years ago

Great! But some error again: Waiting for SSH to be available... Error creating machine: Error detecting OS: Too many retries waiting for SSH to be available. Last error: Maximum number of retries (60) exceeded

zhangbbo commented 8 years ago

I just try in Lyon site, cluster "sagittaire". It works very well. Do you reinstall OS by command "kadeploy3".

ghost commented 8 years ago

Is reinstall command another then install? It was worked first time, but now I cannot run it.

zhangbbo commented 8 years ago

It's the same. You should re-run it in the terminal where you run 'oarsub' command.

ghost commented 8 years ago

Yes, I know, but it says now

You do not have sufficient rights to perform the operation on all the nodes [Kadeploy Error #6]

and I cannot saved original command, I have found earlier, but forgot exact method. Now I try: kadeploy3 -e jessie-x64-base -m sagittaire-[9,43,74].lyon.grid5000.fr

zhangbbo commented 8 years ago

Maybe you should use this command "kadeploy3 -e jessie-x64-base -f $OAR_FILE_NODES -k ~/.ssh/id_rsa.pub"

This is much better to directly indicate the machine names.

ghost commented 8 years ago

Two, maybe last things:

  1. echo "$OAR_FILE_NODES"gives empty output, but I see previously mentined nodes.
  2. The file -k cannot be read, but I can cat the file.
zhangbbo commented 8 years ago
  1. You can try "cat $OAR_FILE_NODES | uniq", it will show the nodes you got.
  2. Sorry, I don't understand your question. You mean ~/.ssh/id_rsa.pub can not be read by the command "kadeploy3"? Emmm...... I never meet this problem before. Maybe, you should post this issue in users-g5k mailing-list.