Switch to libvirt - Githubissues

at15 commented 7 years ago

Due to #13 , we need to use libvirt, currently only the fedora box is working, the base script may need some modification, or I could simply execute the install script on every machine, only hibench took a long time to compile, I could use the pre compiled one

[ ] default sync is rsync in single direction
[x] didn't try static ip and private network
- I can ping from host via the ip I saw from ifconfig
[x] the ubuntu box I built from bento using packer is not working, got waiting for ip .... now using fedora

Ref

https://developer.fedoraproject.org/tools/vagrant/vagrant-libvirt.html

at15 commented 7 years ago

can't use 9p for shared folder

There was an error talking to Libvirt. The error message is shown
below:

Call to virDomainCreateWithFlags failed: internal error: process exited while connecting to monitor: 2017-03-13T16:22:05.452554Z qemu-system-x86_64: -device virtio-9p-pci,id=fs0,fsdev=fsdev-fs0,mount_tag=b0211f19c2b24becc176a46c2524d9f,bus=pci.0,addr=0x6: 9pfs Failed to initialize fs-driver with id:fsdev-fs0 and export path:/home/at15/workspace/src/github.com/at15/hadoop-spark-perf/provision/base

at15 commented 7 years ago

got error when package the box

base: Require set read access to /var/lib/libvirt/images/base_base.img. sudo chmod a+r /var/lib/libvirt/images/base_base.img

and another

/home/at15/.vagrant.d/gems/gems/vagrant-libvirt-0.0.37/lib/vagrant-libvirt/action/package_domain.rb:41:in ``': No such file or directory - virt-sysprep (Errno::ENOENT)

at15 commented 7 years ago

re packaged box stuck on waiting for ssh to become available ....

==> single: Waiting for domain to get an IP address...
==> single: Waiting for SSH to become available...

found a similar one on https://github.com/vagrant-libvirt/vagrant-libvirt/issues/452 I guess this is related to packing the box, so the solution is simple ... I just don't package the box ... run the install script on every node ....

at15 commented 7 years ago

follow this to change the pool, but it seems got error for permission http://ask.xmodulo.com/change-default-location-libvirt-vm-images.html

Call to virDomainCreateWithFlags failed: Cannot access storage file '/home/at15/tmp/libvirt/cluster_slave2.img' (as uid:107, gid:107): Permission denied

drwxr-xr-x.  2 root root 4096 Mar 13 11:02 images

drwxrwxr-x   2 at15 at15  4096 Mar 13 11:07 libvirt

https://github.com/adrahon/vagrant-kvm/issues/163 mentioned change to root, may need reboot or logout? /etc/libvirt/qemu.conf

In particular note that if using the "system" instance and attempting to store disk images in a user home directory, the default permissions on $HOME are typically too restrictive to allow access.

solution

change user = "at15" in /etc/libvirt/eqmu.conf, the group is root, I don't if it will still work if I comment out the group ... but ... e

at15 commented 7 years ago

Hadoop

Stop HDFS and Yarn
Stopping namenodes on [master.perf.at15]
master.perf.at15: Warning: Permanently added 'master.perf.at15,192.168.233.18' (ECDSA) to the list of known hosts.
master.perf.at15: no namenode to stop
slave2.perf.at15: Warning: Permanently added 'slave2.perf.at15,192.168.233.20' (ECDSA) to the list of known hosts.
slave1.perf.at15: Warning: Permanently added 'slave1.perf.at15,192.168.233.19' (ECDSA) to the list of known hosts.
master.perf.at15: Warning: Permanently added 'master.perf.at15,192.168.233.18' (ECDSA) to the list of known hosts.
slave2.perf.at15: no datanode to stop
slave1.perf.at15: stopping datanode
master.perf.at15: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
0.0.0.0: stopping secondarynamenode
stopping yarn daemons
no resourcemanager to stop
master.perf.at15: Warning: Permanently added 'master.perf.at15,192.168.233.18' (ECDSA) to the list of known hosts.
slave1.perf.at15: Warning: Permanently added 'slave1.perf.at15,192.168.233.19' (ECDSA) to the list of known hosts.
slave2.perf.at15: Warning: Permanently added 'slave2.perf.at15,192.168.233.20' (ECDSA) to the list of known hosts.
slave1.perf.at15: stopping nodemanager
master.perf.at15: stopping nodemanager
slave2.perf.at15: stopping nodemanager
slave2.perf.at15: nodemanager did not stop gracefully after 5 seconds: killing with kill -9
no proxyserver to stop
Finish stop HDFS and Yarn

Spark

seems all the nodes are not started .... even the master node itself

And the master always took a long time to start, which is quite strange ...

the error message for spark is

17/03/13 19:13:03 INFO master.Master: I have been elected leader! New state: ALIVE
17/03/13 19:14:24 INFO master.Master: 192.168.233.1:33110 got disassociated, removing it.
17/03/13 19:14:31 INFO master.Master: 192.168.233.18:59868 got disassociated, removing it.
17/03/13 19:14:53 INFO master.Master: 192.168.233.18:59870 got disassociated, removing it.
17/03/13 19:15:13 INFO master.Master: 192.168.233.18:59874 got disassociated, removing it.

http://stackoverflow.com/questions/23063439/my-sparks-worker-cannot-connect-master-something-wrong-with-akka

might have to do with selinux ...

/etc/selinux/config set to disabled, nop, won't work
e... I am an idiot ... 7077 is the internal port, 8080 is the web ui...

at15 commented 7 years ago

hadoop datanode fail to start http://stackoverflow.com/questions/22316187/datanode-not-starts-correctly because fedora does not clean tmp, so I can't format namenode everytime ...

got exception for spark ...

Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
        at com.intel.hibench.sparkbench.micro.ScalaSort$.main(ScalaSort.scala:47)
        at com.intel.hibench.sparkbench.micro.ScalaSort.main(ScalaSort.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I guess that's because how I built hibench? Yeah ... re built in the master box works ....

at15 commented 7 years ago

ok ... got the perf for hadoop cluster in slave node Yarnchild

       2551.303147      task-clock:u (msec)       #    0.030 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             2,734      page-faults:u             #    0.001 M/sec                  
     7,952,773,016      cycles:u                  #    3.117 GHz                    
    12,035,713,476      instructions:u            #    1.51  insn per cycle         
     1,913,355,376      branches:u                #  749.952 M/sec                  
        44,277,795      branch-misses:u           #    2.31% of all branches

at15 commented 7 years ago

spark coarsegrainedexecutorbackend cluster

      15610.605287      task-clock:u (msec)       #    0.605 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
           428,833      page-faults:u             #    0.027 M/sec                  
    48,751,727,760      cycles:u                  #    3.123 GHz                    
    63,078,193,487      instructions:u            #    1.29  insn per cycle         
     9,912,664,768      branches:u                #  634.996 M/sec                  
       154,409,990      branch-misses:u           #    1.56% of all branches

at15 commented 7 years ago

spark coarse .... single, sort small

     10812.559971      task-clock:u (msec)       #    0.619 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            81,832      page-faults:u             #    0.008 M/sec                  
    36,533,260,893      cycles:u                  #    3.379 GHz                    
    60,894,812,484      instructions:u            #    1.67  insn per cycle         
     9,876,637,449      branches:u                #  913.441 M/sec                  
       217,182,645      branch-misses:u           #    2.20% of all branches

at15 / hadoop-spark-perf

Switch to libvirt #14