apavlo / h-store

H-Store Distributed Main Memory OLTP Database System
https://hstore.cs.brown.edu
GNU General Public License v3.0
567 stars 175 forks source link

problem at deploying hstore at the AWS #152

Open i-chaochen opened 10 years ago

i-chaochen commented 10 years ago

hi, andy

I follow the document about running on EC2 steps as follows but failed to ant build

sudo vim /etc/apt/sources.list deb http://archive.canonical.com/ubuntu lucid partner deb-src http://archive.canonical.com/ubuntu lucid partner

sudo apt-get update

Package sun-java6-jdk is not available so I change it as openjdk-6-jdk sudo apt-get --yes install subversion gcc g++ make openjdk-6-jdk valgrind ant

svn co https://database.cs.brown.edu/svn/hstore/trunk/ $HSTORE_HOME

cp hstore.pem ~/.ssh/ && chmod 400 ~/.ssh/hstore.pem

vim trunk/properties/default.properties

global.sshoptions = -i /home/ubuntu/.ssh/hstore.pem

ant build


ee:

 [exec] g++  -Wall -Wextra -Werror -Woverloaded-virtual -Wconversion -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Winit-self -Wno-sign-compare -Wno-unused-parameter -Wno-unused-but-set-variable -pthread -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -DNOCLOCK -fno-omit-frame-pointer -fvisibility=hidden -DBOOST_SP_DISABLE_THREADS -Wno-ignored-qualifiers -fno-strict-aliasing -Wno-attributes -DLINUX -fPIC -isystem ../../third_party/cpp -I../../src/ee  -c  -g3 -O3 -mmmx -msse -msse2 -msse3 -DNDEBUG -DVOLT_LOG_LEVEL=500 -o objects/indexes/tableindex.co ../../src/ee/indexes/tableindex.cpp
 [exec] g++  -Wall -Wextra -Werror -Woverloaded-virtual -Wconversion -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Winit-self -Wno-sign-compare -Wno-unused-parameter -Wno-unused-but-set-variable -pthread -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -DNOCLOCK -fno-omit-frame-pointer -fvisibility=hidden -DBOOST_SP_DISABLE_THREADS -Wno-ignored-qualifiers -fno-strict-aliasing -Wno-attributes -DLINUX -fPIC -isystem ../../third_party/cpp -I../../src/ee  -c  -g3 -O3 -mmmx -msse -msse2 -msse3 -DNDEBUG -DVOLT_LOG_LEVEL=500 -o objects/indexes/tableindexfactory.co ../../src/ee/indexes/tableindexfactory.cpp

BUILD FAILED /home/ubuntu/trunk/build.xml:715: exec returned: 137

because svn ant build failed, so I remove it and try the source from git

sudo rm -r trunk/ sudo apt-get install git git clone git://github.com/apavlo/h-store.git ant build

ee-build: [exec] make: Entering directory `/home/ubuntu/h-store/obj/release' [exec] g++ -Wall -Wextra -Werror -Woverloaded-virtual -Wconversion -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Winit-self -Wno-sign-compare -Wno-unused-parameter -pthread -DSTDC_CONSTANT_MACROS -DSTDC_LIMIT_MACROS -DNOCLOCK -fno-omit-frame-pointer -fvisibility=hidden -DBOOST_SP_DISABLE_THREADS -Wno-ignored-qualifiers -fno-strict-aliasing -Wno-attributes -DLINUX -fPIC -Wno-unused-but-set-variable -DANTICACHE -DANTICACHE_REVERSIBLE_LRU -isystem ../../third_party/cpp -isystem ../../obj/release/berkeleydb -I../../src/ee -c -g3 -O3 -mmmx -msse -msse2 -msse3 -DNDEBUG -DVOLT_LOG_LEVEL=500 -o objects//voltdbjni.co ../../src/ee//voltdbjni.cpp

BUILD FAILED /home/ubuntu/h-store/build.xml:860: exec returned: 137

Total time: 9 minutes 36 seconds

any helps will be greatly appreciated !

apavlo commented 10 years ago

That document looks out of date. You don't want to use the really old SVN repo. You want to use this Github one.

i-chaochen commented 10 years ago

yes, I tried the source from github, but it still failed to build

git clone git://github.com/apavlo/h-store.git ant build

ee-build: [exec] make: Entering directory `/home/ubuntu/h-store/obj/release' [exec] g++ -Wall -Wextra -Werror -Woverloaded-virtual -Wconversion -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Winit-self -Wno-sign-compare -Wno-unused-parameter -pthread -DSTDC_CONSTANT_MACROS -DSTDC_LIMIT_MACROS -DNOCLOCK -fno-omit-frame-pointer -fvisibility=hidden -DBOOST_SP_DISABLE_THREADS -Wno-ignored-qualifiers -fno-strict-aliasing -Wno-attributes -DLINUX -fPIC -Wno-unused-but-set-variable -DANTICACHE -DANTICACHE_REVERSIBLE_LRU -isystem ../../third_party/cpp -isystem ../../obj/release/berkeleydb -I../../src/ee -c -g3 -O3 -mmmx -msse -msse2 -msse3 -DNDEBUG -DVOLT_LOG_LEVEL=500 -o objects//voltdbjni.co ../../src/ee//voltdbjni.cpp

BUILD FAILED /home/ubuntu/h-store/build.xml:860: exec returned: 137

Total time: 9 minutes 36 seconds

thanks

apavlo commented 10 years ago

Is there an error from gcc? It's weird that it just fails like that?

i-chaochen commented 10 years ago

I think I finally figure out this problem, it runs out of all memory at DANTICACHE_REVERSIBLE_LRU -isystem ../../third_party/cpp -isystem ../../obj/release/berkeleydb -I../../src/ee -c -g3 -O3 -mmmx -msse -msse2 -msse3 -DNDEBUG -DVOLT_LOG_LEVEL=500 -o objects//voltdbjni.co ../../src/ee//voltdbjni.cpp

I used a micro ec2 which only has 0.6g memory...

I try another medium one and build successfully.

to who wants to try hstore on AWS please at lease use a medium size ec2...

thanks

i-chaochen commented 10 years ago

now I can build it but still unable to execute the benchmark at AWS NFS cluster.

my 2 nfs cluster nodes within the same security group TCP Port (Service) Source Action 22 (SSH) 0.0.0.0/0 Delete 111 0.0.0.0/0 Delete 2049 0.0.0.0/0 Delete 44182 0.0.0.0/0 Delete 54508 0.0.0.0/0 Delete UDP Port (Service) Source Action 111 0.0.0.0/0 Delete 2049 0.0.0.0/0 Delete 32768 0.0.0.0/0 Delete 32770 - 32800 0.0.0.0/0 Delete

I configure the ssh environment sudo apt-get --yes install openssh-server ssh-keygen -t dsa # Do not enter in a password cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys $ ssh -o StrictHostKeyChecking=no localhost "date" Wed Jan 29 00:58:12 UTC 2014

$ ssh localhost date Wed Jan 29 01:00:14 UTC 2014

I scp my hstore.pem on nfs server node cp hstore.pem ~/.ssh/ && chmod 400 ~/.ssh/hstore.pem

change the global.sshoptions parameter in $HSTORE_HOME/properties/default.properties as global.sshoptions = -i /home/ubuntu/.ssh/hstore.pem

create a cluster.txt as follow: host0.ip-172-31-xx-xxx.eu-west-1.compute.internal:0:0-1 host1.ip-172-31-xx-xx.eu-west-1.compute.internal:1:2-3

no problem at here ant hstore-prepare -Dproject=tpcc -Dhosts=/home/ubuntu/cluster.txt

$ ant hstore-benchmark -Dproject=tpcc Buildfile: /home/ubuntu/h-store/build.xml

hstore-benchmark:

benchmark: [java] 00:58:59,774 INFO - ------------------------- BENCHMARK INITIALIZE :: TPCC ------------------------- [java] 00:58:59,854 INFO - Starting HStoreSite H00 on host0.ip-172-31-33-172.eu-west-1.compute.internal [java] 00:58:59,907 INFO - Starting HStoreSite H01 on host1.ip-172-31-24-5.eu-west-1.compute.internal [java] 00:58:59,980 INFO - Waiting for 2 HStoreSites with 4 partitions to finish initialization [java] 00:59:04,910 ERROR - Failed to poll 'site-00-host0.ip-172-31-33-172.eu-west-1.compute.internal' [exitValue=255] [java] 00:59:04,910 FATAL - Process 'site-00-host0.ip-172-31-33-172.eu-west-1.compute.internal' failed. Halting benchmark! [java] 00:59:06,413 FATAL - Failed to complete benchmark [java] java.lang.RuntimeException: Failed to start all HStoreSites. Halting benchmark [java] at edu.brown.api.BenchmarkController.startSites(BenchmarkController.java:633) [java] at edu.brown.api.BenchmarkController.setupBenchmark(BenchmarkController.java:504) [java] at edu.brown.api.BenchmarkController.main(BenchmarkController.java:2216)

BUILD FAILED /home/ubuntu/h-store/build.xml:2517: The following error occurred while executing this line: /home/ubuntu/h-store/build.xml:1693: Java returned: 1

Total time: 15 seconds

didn't see any useful log from these 2 nodes ~/h-store/obj/logs/sites$ cat site-00-host0.ip-172-31-xx-xxx.eu-west-1.compute.internal.log

2014-01-29T00:58:59.895.0

:~/h-store/obj/logs/sites$ cat site-01-host1.ip-172-31-xx-xxx.eu-west-1.compute.internal.log

2014-01-29T00:58:59.971.0

any advices?

thanks!

apavlo commented 10 years ago

Use the internal IP addresses instead of the public ones.

i-chaochen commented 10 years ago

yes, I am using the aws internal dns as you can see my cluster.txt host0.ip-172-31-xx-xxx.eu-west-1.compute.internal:0:0-1 host1.ip-172-31-xx-xx.eu-west-1.compute.internal:1:2-3

and internal ip for nfs cluster

but it just can't execute.

do you mean I use internal ip address instead of internal dns address at cluster.txt?

so like this? host0.172.31.xx.xxx :0:0-1 host1.172-31.xx.xx:1:2-3

thanks

apavlo commented 10 years ago

Enable DEBUG for 'org/voltdb/processtools/ProcessSetManager.java' in log4j.properties

See what the SSH command is that it's trying to use to start the sites and see whether you can fire them off by hand.

Andy Pavlo pavlo@cs.cmu.edu

i-chaochen commented 10 years ago

sorry I am not sure I'm completely following you, I changed voltdb area as DEBUG at log4j.properties

VoltDB Stuff

log4j.logger.org.voltdb.VoltProcedure=DEBUG log4j.logger.org.voltdb.VoltSystemProcedure=DEBUG log4j.logger.org.voltdb.client=DEBUG log4j.logger.org.voltdb.compiler=DEBUG log4j.logger.org.voltdb.planner=DEBUG

after ant hstore-prepare -Dproject=tpcc -Dhosts=/home/ubuntu/cluster.txt I haven't seen any things related to SSH command.

still, $ ant hstore-benchmark -Dproject=tpcc Buildfile: /home/ubuntu/h-store/build.xml

hstore-benchmark:

benchmark: [java] 03:16:24,604 INFO - ------------------------- BENCHMARK INITIALIZE :: TPCC ------------------------- [java] 03:16:24,673 INFO - Starting HStoreSite H00 on host0.ip-172-31-xx-xx.eu-west-1.compute.internal [java] 03:16:24,726 INFO - Starting HStoreSite H01 on host1.ip-172-31-xx-xx.eu-west-1.compute.internal [java] 03:16:24,782 INFO - Starting HStoreSite H02 on host2.ip-172-31-xx-xx.eu-west-1.compute.internal [java] 03:16:24,863 INFO - Waiting for 3 HStoreSites with 6 partitions to finish initialization [java] 03:16:29,729 ERROR - Failed to poll 'site-01-host1.ip-172-31-xx-xx.eu-west-1.compute.internal' [exitValue=255] [java] 03:16:29,729 FATAL - Process 'site-01-host1.ip-172-31-xx-xx.eu-west-1.compute.internal' failed. Halting benchmark! [java] 03:16:31,232 FATAL - Failed to complete benchmark [java] java.lang.RuntimeException: Failed to start all HStoreSites. Halting benchmark [java] at edu.brown.api.BenchmarkController.startSites(BenchmarkController.java:633) [java] at edu.brown.api.BenchmarkController.setupBenchmark(BenchmarkController.java:504) [java] at edu.brown.api.BenchmarkController.main(BenchmarkController.java:2216)

BUILD FAILED /home/ubuntu/h-store/build.xml:2517: The following error occurred while executing this line: /home/ubuntu/h-store/build.xml:1693: Java returned: 1

Total time: 11 seconds

I checked the log it hasn't any useful info still

$ cat site-01-host1.ip-172-31-xx-xx.eu-west-1.compute.internal.log

2014-01-29T03:16:24.778.0

thanks

i-chaochen commented 10 years ago

hi, andy

I checked ProcessSetManager.java ,

does use "ping" command to create the process?

public static void main(String[] args) {
    ProcessSetManager psm = new ProcessSetManager();
    psm.startProcess("ping4c", new String[] { "ping", "volt4c" });
    psm.startProcess("ping3c", new String[] { "ping", "volt3c" });
    while(true) {
        OutputLine line = psm.nextBlocking();
        System.out.printf("(%s:%s): %s\n", line.processName, line.stream.name(), line.value);
    }
}

I open the ICMP port to security group but still unable to execute the benchmark

and then I open ALL traffic ports to all ips at this security group, so I think no matter what kind of commands hstore use it should have no problem within security group.

but it still fails to execute the benchmark [java] 22:23:22,433 INFO - Starting HStoreSite H00 on host0.ip-172-31-xx-x.eu-west-1.compute.internal [java] 22:23:22,572 INFO - Starting HStoreSite H01 on host1.ip-172-31-xx-x.eu-west-1.compute.internal [java] 22:23:22,709 INFO - Starting HStoreSite H02 on host2.ip-172-31-xx-x.eu-west-1.compute.internal [java] 22:23:22,837 INFO - Waiting for 3 HStoreSites with 6 partitions to finish initialization [java] 22:23:27,595 ERROR - Failed to poll 'site-01-host1.ip-172-31-xx-x.eu-west-1.compute.internal' [exitValue=255] [java] 22:23:27,596 FATAL - Process 'site-01-host1.ip-172-31-xx-x.eu-west-1.compute.internal' failed. Halting benchmark! [java] 22:23:29,100 FATAL - Failed to complete benchmark [java] java.lang.RuntimeException: Failed to start all HStoreSites. Halting benchmark [java] at edu.brown.api.BenchmarkController.startSites(BenchmarkController.java:633) [java] at edu.brown.api.BenchmarkController.setupBenchmark(BenchmarkController.java:504) [java] at edu.brown.api.BenchmarkController.main(BenchmarkController.java:2216)

BUILD FAILED /home/ubuntu/h-store/build.xml:2517: The following error occurred while executing this line: /home/ubuntu/h-store/build.xml:1693: Java returned: 1

Total time: 50 seconds

and there is no info for these two logs except date ~/h-store/obj/logs/sites$ cat site-01-host1.172.31.xx.x.eu-west-1.compute.internal.log

2014-01-29T03:32:26.251.0

~/h-store/obj/logs/sites$ cat site-01-host1.ip-172-31-xx-x.eu-west-1.compute.internal.log

2014-01-29T22:23:22.698.0

I am quite suspecting about cluster.txt, is it on the right format? $ cat cluster.txt host0.ip-172-31-xx-x.eu-west-1.compute.internal:0:0-1 host1.ip-172-31-xx-x.eu-west-1.compute.internal:1:2-3 host2.ip-172-31-xx-x.eu-west-1.compute.internal:2:4-5

any further advices will be appreciated.

thanks

apavlo commented 10 years ago

Add this to the bottom of log4j.properties:

log4j.logger.org.voltdb.processtools.ProcessSetManager=DEBUG

Run the benchmark with this turned on, then check the site log to look for the SSH command that it's trying to send over the wire. Then copy and paste that command in a terminal to check whether it works.

i-chaochen commented 10 years ago

yes, I add it and copy the ssh commands run it by hand, it displays failed to connect to remote site

I check the source codes about connecting remote codes have two things quite confused

  1. should I change my ec2 hostname like host0, host1 and host2 at cluster.txt ?

does the ssh login username effect the connection? I change all host0, host1 and host2 as ubuntu at cluster. txt, since it's default name for ec2, but still failed at execution.

  1. when I built nfs cluster on aws, I followed the steps from http://hstore.cs.brown.edu/documentation/deployment/running-on-amazon-ec2/

the autofs part it sets as

which automatically syncs all folders and files under /home/

but when I set each nfs server and clients ssh environment by ssh-keygen -t dsa cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

the autofs will automatically sync each key to all other.

which means I only can run ssh localhost date

at one ec2.

so, should I re-write my auto.home file not sync all files under /home/& ?

because I see document mentioned specifically that the directory needs to end with a '/' followed by a '&

but it looks like against the ssh environment configuration. so would you give me some clues? please

thanks

On 29 Jan 2014 22:48, "Andy Pavlo" notifications@github.com wrote:

Add this to the bottom of log4j.properties:

log4j.logger.org.voltdb.processtools.ProcessSetManager=DEBUG

Run the benchmark with this turned on, then check the site log to look for the SSH command that it's trying to send over the wire. Then copy and paste that command in a terminal to check whether it works.

Reply to this email directly or view it on GitHubhttps://github.com/apavlo/h-store/issues/152#issuecomment-33640876 .

i-chaochen commented 10 years ago

hi andy

I changed all ec2's hostname same as cluster.txt and only mount h-store folder instead of /home/& within NFS clusters this time, and I add this line in log4j.properties: log4j.logger.org.voltdb.processtools.ProcessSetManager=DEBUG

and I run ssh command by hand, it returns as "Unable to set CPU affinity.." and "Insufficient number of cores " so disable transaction pre/post processing threads, and the connection and execution is failed.

but I can execute H-store benchmark at a single large size ec2 without any problem.

I build this NFS Cluster at AWS by 3 same large size ec2, it indicates insufficient number of cores.

Does hstore is a sharding nosql system, each node within system is isolated with others? Should it need less system resource if I use a cluster to run this benchmark instead of a singe machine?
why I can execute it at a single large ec2 but can't execute it at 3 equal size ec2 as insufficient number of cores? should I use a more expensive larger ec2 to build cluster to execute this benchmark? or is any other thing I did wrong, such as only mounted h-store folder within the NFS cluster?

would you give me some clues on it, please?

thanks!