cloudera / clusterdock

Apache License 2.0
70 stars 57 forks source link

cloudera-scm-agent: bash: /var/log/cloudera-scm-agent/cloudera-scm-agent.out: No such file or directory #26

Closed lc2a closed 7 years ago

lc2a commented 7 years ago

[root@vm02 ~]# clusterdock_run ./bin/start_cluster -n testing cdh --primary-node=node-1 --secondary-nodes='node-{2..4}' --include-service-types=HDFS,HIVE,HUE,ZOOKEEPER,HBASE,YARN,SPARK_ON_YARN,SQOOP2 ++ clusterdock_run ./bin/start_cluster -n testing cdh --primary-node=node-1 '--secondary-nodes=node-{2..4}' --include-service-types=HDFS,HIVE,HUE,ZOOKEEPER,HBASE,YARN,SPARK_ON_YARN,SQOOP2 ++ '[' -z docker.io/cloudera/clusterdock:latest ']' ++ '[' '' '!=' false ']' ++ sudo docker pull docker.io/cloudera/clusterdock:latest ++ '[' -n '' ']' ++ '[' -n '' ']' ++ '[' -n '' ']' ++ '[' -n '' ']' ++ '[' -n '' ']' ++ B -bash: B: command not found ++ sudo docker run --net=host -t --privileged -v /tmp/clusterdock -v /etc/hosts:/etc/hosts -v /etc/localtime:/etc/localtime -v /var/run/docker.sock:/var/run/docker.sock docker.io/cloudera/clusterdock:latest ./bin/start_cluster -n testing cdh --primary-node=node-1 '--secondary-nodes=node-{2..4}' --include-service-types=HDFS,HIVE,HUE,ZOOKEEPER,HBASE,YARN,SPARK_ON_YARN,SQOOP2 INFO:clusterdock.topologies.cdh.actions:Pulling image docker.io/cloudera/clusterdock:cdh580_cm581_primary-node. This might take a little while... Trying to pull repository docker.io/cloudera/clusterdock ... cdh580_cm581_primary-node: Pulling from docker.io/cloudera/clusterdock Digest: sha256:9feffbfc5573262a6efbbb0a969efde890e63ced8a4ab3c9982f4f0dc607e429 INFO:clusterdock.topologies.cdh.actions:Pulling image docker.io/cloudera/clusterdock:cdh580_cm581_secondary-node. This might take a little while... Trying to pull repository docker.io/cloudera/clusterdock ... cdh580_cm581_secondary-node: Pulling from docker.io/cloudera/clusterdock Digest: sha256:251778378b362adff4e93b99d423848216e4823965dabd1bd4c41dbb4c79afcf INFO:clusterdock.cluster:Successfully started node-2.testing (IP address: 192.168.124.7). INFO:clusterdock.cluster:Successfully started node-3.testing (IP address: 192.168.124.8). INFO:clusterdock.cluster:Successfully started node-4.testing (IP address: 192.168.124.9). INFO:clusterdock.cluster:Successfully started node-1.testing (IP address: 192.168.124.6). INFO:clusterdock.cluster:Started cluster in 12.68 seconds. INFO:clusterdock.topologies.cdh.actions:Changing server_host to node-1.testing in /etc/cloudera-scm-agent/config.ini... INFO:clusterdock.topologies.cdh.actions:Removing files (/var/lib/cloudera-scm-agent/uuid, /dfs/dn/current/) from hosts (node-3.testing, node-4.testing)... INFO:clusterdock.topologies.cdh.actions:Restarting CM agents... cloudera-scm-agent is already stopped cloudera-scm-agent is already stopped Starting cloudera-scm-agent: bash: /var/log/cloudera-scm-agent/cloudera-scm-agent.out: No such file or directory [FAILED] Starting cloudera-scm-agent: bash: /var/log/cloudera-scm-agent/cloudera-scm-agent.out: No such file or directory [FAILED]

Fatal error: run() received nonzero return code 1 while executing!

Requested: service cloudera-scm-agent restart Executed: /bin/bash -l -c "service cloudera-scm-agent restart"

Aborting.

Fatal error: run() received nonzero return code 1 while executing!

Requested: service cloudera-scm-agent restart Executed: /bin/bash -l -c "service cloudera-scm-agent restart"

Aborting. cloudera-scm-agent is already stopped Starting cloudera-scm-agent: bash: /var/log/cloudera-scm-agent/cloudera-scm-agent.out: No such file or directory [FAILED]

Fatal error: run() received nonzero return code 1 while executing!

Requested: service cloudera-scm-agent restart Executed: /bin/bash -l -c "service cloudera-scm-agent restart"

Aborting. cloudera-scm-agent is already stopped Starting cloudera-scm-agent: bash: /var/log/cloudera-scm-agent/cloudera-scm-agent.out: No such file or directory [FAILED]

Fatal error: run() received nonzero return code 1 while executing!

Requested: service cloudera-scm-agent restart Executed: /bin/bash -l -c "service cloudera-scm-agent restart"

Aborting.

Fatal error: One or more hosts failed while executing task '_task'

Aborting. INFO:clusterdock.topologies.cdh.actions:Waiting for Cloudera Manager server to come online... Traceback (most recent call last): File "./bin/start_cluster", line 70, in main() File "./bin/start_cluster", line 63, in main actions.start(args) File "/root/clusterdock/clusterdock/topologies/cdh/actions.py", line 108, in start CM_SERVER_PORT, timeout_sec=180) File "/root/clusterdock/clusterdock/utils.py", line 52, in wait_for_port_open timeout_sec, address, port Exception: Timed out after 180 seconds waiting for 192.168.124.6:7180 to be open. ++ '[' -n '' ']' ++ printf '\033]0;%s@%s:%s\007' root vm02 '~'

dimaspivak commented 7 years ago

How many CPUs/how much RAM does the machine you're using have? What OS is it running?

lc2a commented 7 years ago

[root@vm-02 ~]# free -g total used free shared buff/cache available Mem: 47 1 43 0 1 45 Swap: 7 0 7

[root@vm-02 ~]# rpm --query centos-release centos-release-7-3.1611.el7.centos.x86_64

[root@vm-02 ~]# cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 21 model : 2 model name : AMD Opteron(tm) Processor 6386 SE stepping : 0 microcode : 0x6000832 cpu MHz : 2792.034 cache size : 2048 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 8 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm constant_tsc art rep_good nopl tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm extapic abm sse4a misalignsse 3dnowprefetch osvw xop fma4 arat bogomips : 5586.00 TLB size : 1536 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management:

processor : 1 vendor_id : AuthenticAMD cpu family : 21 model : 2 model name : AMD Opteron(tm) Processor 6386 SE stepping : 0 microcode : 0x6000832 cpu MHz : 2792.034 cache size : 2048 KB physical id : 0 siblings : 8 core id : 1 cpu cores : 8 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm constant_tsc art rep_good nopl tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm extapic abm sse4a misalignsse 3dnowprefetch osvw xop fma4 arat bogomips : 5586.00 TLB size : 1536 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management:

processor : 2 vendor_id : AuthenticAMD cpu family : 21 model : 2 model name : AMD Opteron(tm) Processor 6386 SE stepping : 0 microcode : 0x6000832 cpu MHz : 2792.034 cache size : 2048 KB physical id : 0 siblings : 8 core id : 2 cpu cores : 8 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm constant_tsc art rep_good nopl tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm extapic abm sse4a misalignsse 3dnowprefetch osvw xop fma4 arat bogomips : 5586.00 TLB size : 1536 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management:

processor : 3 vendor_id : AuthenticAMD cpu family : 21 model : 2 model name : AMD Opteron(tm) Processor 6386 SE stepping : 0 microcode : 0x6000832 cpu MHz : 2792.034 cache size : 2048 KB physical id : 0 siblings : 8 core id : 3 cpu cores : 8 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm constant_tsc art rep_good nopl tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm extapic abm sse4a misalignsse 3dnowprefetch osvw xop fma4 arat bogomips : 5586.00 TLB size : 1536 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management:

processor : 4 vendor_id : AuthenticAMD cpu family : 21 model : 2 model name : AMD Opteron(tm) Processor 6386 SE stepping : 0 microcode : 0x6000832 cpu MHz : 2792.034 cache size : 2048 KB physical id : 0 siblings : 8 core id : 4 cpu cores : 8 apicid : 4 initial apicid : 4 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm constant_tsc art rep_good nopl tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm extapic abm sse4a misalignsse 3dnowprefetch osvw xop fma4 arat bogomips : 5586.00 TLB size : 1536 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management:

processor : 5 vendor_id : AuthenticAMD cpu family : 21 model : 2 model name : AMD Opteron(tm) Processor 6386 SE stepping : 0 microcode : 0x6000832 cpu MHz : 2792.034 cache size : 2048 KB physical id : 0 siblings : 8 core id : 5 cpu cores : 8 apicid : 5 initial apicid : 5 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm constant_tsc art rep_good nopl tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm extapic abm sse4a misalignsse 3dnowprefetch osvw xop fma4 arat bogomips : 5586.00 TLB size : 1536 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management:

processor : 6 vendor_id : AuthenticAMD cpu family : 21 model : 2 model name : AMD Opteron(tm) Processor 6386 SE stepping : 0 microcode : 0x6000832 cpu MHz : 2792.034 cache size : 2048 KB physical id : 0 siblings : 8 core id : 6 cpu cores : 8 apicid : 6 initial apicid : 6 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm constant_tsc art rep_good nopl tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm extapic abm sse4a misalignsse 3dnowprefetch osvw xop fma4 arat bogomips : 5586.00 TLB size : 1536 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management:

processor : 7 vendor_id : AuthenticAMD cpu family : 21 model : 2 model name : AMD Opteron(tm) Processor 6386 SE stepping : 0 microcode : 0x6000832 cpu MHz : 2792.034 cache size : 2048 KB physical id : 0 siblings : 8 core id : 7 cpu cores : 8 apicid : 7 initial apicid : 7 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm constant_tsc art rep_good nopl tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm extapic abm sse4a misalignsse 3dnowprefetch osvw xop fma4 arat bogomips : 5586.00 TLB size : 1536 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management:

[root@vm-02 ~]# docker version Client: Version: 1.12.6 API version: 1.24 Package version: docker-common-1.12.6-11.el7.centos.x86_64 Go version: go1.7.4 Git commit: 96d83a5/1.12.6 Built: Tue Mar 7 09:23:34 2017 OS/Arch: linux/amd64

Server: Version: 1.12.6 API version: 1.24 Package version: docker-common-1.12.6-11.el7.centos.x86_64 Go version: go1.7.4 Git commit: 96d83a5/1.12.6 Built: Tue Mar 7 09:23:34 2017 OS/Arch: linux/amd64

[root@hdp-test-02 ~]# docker info Containers: 15 Running: 0 Paused: 0 Stopped: 15 Images: 4 Server Version: 1.12.6 Storage Driver: devicemapper Pool Name: docker-253:0-137522201-pool Pool Blocksize: 65.54 kB Base Device Size: 10.74 GB Backing Filesystem: xfs Data file: /dev/loop0 Metadata file: /dev/loop1 Data Space Used: 10.26 GB Data Space Total: 107.4 GB Data Space Available: 14.95 GB Metadata Space Used: 9.638 MB Metadata Space Total: 2.147 GB Metadata Space Available: 2.138 GB Thin Pool Minimum Free Space: 10.74 GB Udev Sync Supported: true Deferred Removal Enabled: false Deferred Deletion Enabled: false Deferred Deleted Device Count: 0 Data loop file: /var/lib/docker/devicemapper/devicemapper/data WARNING: Usage of loopback devices is strongly discouraged for production use. Use --storage-opt dm.thinpooldev to specify a custom block storage device. Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata Library Version: 1.02.135-RHEL7 (2016-11-16) Logging Driver: journald Cgroup Driver: systemd Plugins: Volume: local Network: bridge host null overlay Swarm: inactive Runtimes: docker-runc runc Default Runtime: docker-runc Security Options: seccomp Kernel Version: 3.10.0-514.6.2.el7.x86_64 Operating System: CentOS Linux 7 (Core) OSType: linux Architecture: x86_64 Number of Docker Hooks: 2 CPUs: 8 Total Memory: 47.01 GiB Name: hdp-test-02.hpls.local ID: RJLD:4XQQ:JUMN:KPWO:5D4X:MM3K:CDVF:CN7G:EDHD:UWMO:RUYB:PFFV Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ Insecure Registries: 127.0.0.0/8 Registries: docker.io (secure)

Thanks :)

dimaspivak commented 7 years ago

I'm pretty confident this is caused by running devicemapper as the storage backend driver. Switch over to aufs or overlayfs and you should never see this issue.

lc2a commented 7 years ago

reinstall docker to docker-ce

[root@vm-02 ~]# docker info Containers: 7 Running: 4 Paused: 0 Stopped: 3 Images: 4 Server Version: 17.03.1-ce Storage Driver: overlay Backing Filesystem: xfs Supports d_type: false Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host macvlan null overlay Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe init version: 949e6fa Security Options: seccomp Profile: default Kernel Version: 3.10.0-514.6.2.el7.x86_64 Operating System: CentOS Linux 7 (Core) OSType: linux Architecture: x86_64 CPUs: 8 Total Memory: 47.01 GiB Name: hdp-test-02.hpls.local ID: RJLD:4XQQ:JUMN:KPWO:5D4X:MM3K:CDVF:CN7G:EDHD:UWMO:RUYB:PFFV Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false

[root@vm-02 ~]# clusterdock_run ./bin/start_cluster -n testing cdh --primary-node=node-1 --secondary-nodes='node-{2..4}' --include-service-types=HDFS,HIVE,HUE,ZOOKEEPER,HBASE,YARN,SPARK_ON_YARN,SQOOP2 -bash: B: command not found INFO:clusterdock.cluster:Network (testing) not present, creating it... INFO:clusterdock.cluster:Successfully setup network (name: testing). INFO:clusterdock.cluster:Successfully started node-2.testing (IP address: 192.168.123.3). INFO:clusterdock.cluster:Successfully started node-3.testing (IP address: 192.168.123.4). INFO:clusterdock.cluster:Successfully started node-4.testing (IP address: 192.168.123.5). INFO:clusterdock.cluster:Successfully started node-1.testing (IP address: 192.168.123.2). INFO:clusterdock.cluster:Started cluster in 27.74 seconds. INFO:clusterdock.topologies.cdh.actions:Changing server_host to node-1.testing in /etc/cloudera-scm-agent/config.ini... INFO:clusterdock.topologies.cdh.actions:Removing files (/var/lib/cloudera-scm-agent/uuid, /dfs/dn/current/) from hosts (node-3.testing, node-4.testing)... rm: cannot remove /dfs/dn/current/BP-637181590-192.168.124.2-1469835153284/current/finalized/subdir0/subdir0': Directory not empty rm: cannot remove/dfs/dn/current/BP-637181590-192.168.124.2-1469835153284/current/finalized/subdir0/subdir0': Directory not empty rm: cannot remove /dfs/dn/current/BP-637181590-192.168.124.2-1469835153284/current/finalized/subdir0/subdir1': Directory not empty rm: cannot remove/dfs/dn/current/BP-637181590-192.168.124.2-1469835153284/current/finalized/subdir0/subdir1': Directory not empty rm: cannot remove /dfs/dn/current/BP-637181590-192.168.124.2-1469835153284/current/finalized/subdir0/subdir2': Directory not empty rm: cannot remove/dfs/dn/current/BP-637181590-192.168.124.2-1469835153284/current/finalized/subdir0/subdir2': Directory not empty

Fatal error: run() received nonzero return code 1 while executing!

Requested: rm -rf /var/lib/cloudera-scm-agent/uuid /dfs/dn/current/ Executed: /bin/bash -l -c "rm -rf /var/lib/cloudera-scm-agent/uuid /dfs/dn/current/"

Aborting.

Fatal error: run() received nonzero return code 1 while executing!

Requested: rm -rf /var/lib/cloudera-scm-agent/uuid /dfs/dn/current/ Executed: /bin/bash -l -c "rm -rf /var/lib/cloudera-scm-agent/uuid /dfs/dn/current/"

Aborting.

Fatal error: One or more hosts failed while executing task '_task'

Aborting.

dimaspivak commented 7 years ago

Did you completely remove the /var/lib/docker folder after reinstalling? Still looks like something caused by holdover from devicemapper.

-- -Dima

lc2a commented 7 years ago

Retried.

1) It stop as below [root@vm-02 ~]# clusterdock_run ./bin/start_cluster -n cluster cdh --primary-node=node-1 --secondary-nodes=node-{2..3} --include-service-types=HDFS,YARN --dont-start-cluster -bash: B: command not found INFO:clusterdock.topologies.cdh.actions:Pulling image docker.io/cloudera/clusterdock:cdh580_cm581_secondary-node. This might take a little while... cdh580_cm581_secondary-node: Pulling from cloudera/clusterdock 3eaa9b70c44a: Already exists 99ba8e23f310: Already exists c9c08e9a0d03: Already exists 7434a9a99daa: Already exists d52d9baa0ee6: Already exists f70deff0592f: Pull complete Digest: sha256:251778378b362adff4e93b99d423848216e4823965dabd1bd4c41dbb4c79afcf Status: Image is up to date for cloudera/clusterdock:cdh580_cm581_secondary-node INFO:clusterdock.cluster:Network (cluster) not present, creating it... INFO:clusterdock.cluster:Successfully setup network (name: cluster). INFO:clusterdock.cluster:Successfully started node-3.cluster (IP address: 192.168.124.2). INFO:clusterdock.cluster:Successfully started node-1.cluster (IP address: 192.168.123.2). INFO:clusterdock.cluster:Started cluster in 26.85 seconds. INFO:clusterdock.topologies.cdh.actions:Changing server_host to node-1.cluster in /etc/cloudera-scm-agent/config.ini... INFO:clusterdock.topologies.cdh.actions:Restarting CM agents... cloudera-scm-agent is already stopped Starting cloudera-scm-agent: [ OK ] Stopping cloudera-scm-agent: [ OK ] Starting cloudera-scm-agent: [ OK ] INFO:clusterdock.topologies.cdh.actions:Waiting for Cloudera Manager server to come online... INFO:clusterdock.topologies.cdh.actions:Detected Cloudera Manager server after 108.39 seconds. INFO:clusterdock.topologies.cdh.actions:CM server is now accessible at http://test.local:32769 INFO:clusterdock.topologies.cdh.cm:Detected CM API v13. INFO:clusterdock.topologies.cdh.cm_utils:Updating database configurations... INFO:clusterdock.topologies.cdh.cm:Updating NameNode references in Hive metastore...

[root@vm-02 ~]# docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 026bab52e24e docker.io/cloudera/clusterdock:cdh580_cm581_secondary-node "/sbin/init" 3 minutes ago Up 3 minutes cocky_clarke 46bf9a62c5b3 docker.io/cloudera/clusterdock:cdh580_cm581_primary-node "/sbin/init" 3 minutes ago Up 3 minutes 0.0.0.0:32775->7180/tcp, 0.0.0.0:32774->8888/tcp romantic_hamilton 25367e5e6e93 docker.io/cloudera/clusterdock:latest "python ./bin/star..." 3 minutes ago Up 3 minutes vigilant_minsky 81b596f7fe27 docker.io/cloudera/clusterdock:latest "python ./bin/hous..." 10 minutes ago Exited (0) 10 minutes ago gracious_hamilton

Login to CM 2) Why ALL services installed? But I have specified --include-service-types=HDFS,YARN,ZOOKEEPER 3) At Host Page, why only 2 nodes? But I have specified --primary-node=node-1 --secondary-nodes=node-{2..4} node-1.cluster 192.168.124.2 22 Role(s) HBase Master HDFS Balancer HDFS NameNode HDFS SecondaryNameNode Hive Gateway Hive Metastore Server HiveServer2 Hue Server Impala Catalog Server Impala StateStore Key-Value Store Indexer Lily HBase Indexer Cloudera Management Service Alert Publisher Cloudera Management Service Event Server Cloudera Management Service Host Monitor Cloudera Management Service Service Monitor Oozie Server Solr Server Spark Gateway Spark History Server YARN (MR2 Included) JobHistory Server YARN (MR2 Included) ResourceManager ZooKeeper Server

node-2.cluster 192.168.124.3 6 Role(s) HBase RegionServer HDFS DataNode Hive Gateway Impala Daemon Spark Gateway YARN (MR2 Included) NodeManager

dimaspivak commented 7 years ago

If it didn't timeout, sounds like it's still running. Removing services is only done once CM setup is complete, thus why you still saw them there.

dimaspivak commented 7 years ago

Ah, part of your problem is that you need to quote arguments if you're passing in --secondary-nodes that use Bash expansion. That is, --secondary-nodes='node-{2..4}', not --secondary-nodes=node-{2..4}. That would explain why you're only seeing two nodes, though you're specifying three (not four).

lc2a commented 7 years ago

oops .. corrected --secondary-nodes='node-{2..4}' but still got the same errors.

rm: cannot remove /dfs/dn/current/BP-637181590-192.168.124.2-1469835153284/current/finalized/subdir0/subdir0': Directory not empty rm: cannot remove/dfs/dn/current/BP-637181590-192.168.124.2-1469835153284/current/finalized/subdir0/subdir0': Directory not empty rm: cannot remove /dfs/dn/current/BP-637181590-192.168.124.2-1469835153284/current/finalized/subdir0/subdir1': Directory not empty rm: cannot remove/dfs/dn/current/BP-637181590-192.168.124.2-1469835153284/current/finalized/subdir0/subdir1': Directory not empty rm: cannot remove `/dfs/dn/current/BP-637181590-192.168.124.2-1469835153284/current/finalized/subdir0/subdir2': Directory not empty

Fatal error: run() received nonzero return code 1 while executing!

Requested: rm -rf /var/lib/cloudera-scm-agent/uuid /dfs/dn/current/ Executed: /bin/bash -l -c "rm -rf /var/lib/cloudera-scm-agent/uuid /dfs/dn/current/"

Aborting. rm: cannot remove `/dfs/dn/current/BP-637181590-192.168.124.2-1469835153284/current/finalized/subdir0/subdir2': Directory not empty

Fatal error: run() received nonzero return code 1 while executing!

Requested: rm -rf /var/lib/cloudera-scm-agent/uuid /dfs/dn/current/ Executed: /bin/bash -l -c "rm -rf /var/lib/cloudera-scm-agent/uuid /dfs/dn/current/"

Aborting.

Fatal error: One or more hosts failed while executing task '_task'

Aborting.

Retried with 2 nodes, but stuck at Hive logging ..

[root@vm-02 ~]# clusterdock_run ./bin/housekeeping nuke -bash: B: command not found INFO:housekeeping:Removing all containers on this host... INFO:housekeeping:Successfully removed all containers on this host. INFO:housekeeping:Removing all user-defined networks on this host... INFO:housekeeping:Successfully removed all user-defined networks on this host. INFO:housekeeping:Clearing container entries from /etc/hosts... INFO:housekeeping:Successfully cleared container entries from /etc/hosts. INFO:housekeeping:Restarting Docker daemon... INFO:housekeeping:Successfully nuked this host. [root@vm-02 ~]# docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 0d4936fe85f4 docker.io/cloudera/clusterdock:latest "python ./bin/hous..." 14 seconds ago Exited (0) 10 seconds ago festive_edison [root@vm-02 ~]# clusterdock_run ./bin/start_cluster -n testing cdh --include-service-types=HDFS,YARN,ZOOKEEPER --dont-start-cluster -bash: B: command not found INFO:clusterdock.cluster:Network (testing) not present, creating it... INFO:clusterdock.cluster:Successfully setup network (name: testing). INFO:clusterdock.cluster:Successfully started node-2.testing (IP address: 192.168.123.3). INFO:clusterdock.cluster:Successfully started node-1.testing (IP address: 192.168.123.2). INFO:clusterdock.cluster:Started cluster in 26.68 seconds. INFO:clusterdock.topologies.cdh.actions:Changing server_host to node-1.testing in /etc/cloudera-scm-agent/config.ini... INFO:clusterdock.topologies.cdh.actions:Restarting CM agents... cloudera-scm-agent is already stopped Starting cloudera-scm-agent: [ OK ] Stopping cloudera-scm-agent: [ OK ] Starting cloudera-scm-agent: [ OK ] INFO:clusterdock.topologies.cdh.actions:Waiting for Cloudera Manager server to come online... INFO:clusterdock.topologies.cdh.actions:Detected Cloudera Manager server after 88.23 seconds. INFO:clusterdock.topologies.cdh.actions:CM server is now accessible at http://test.local:32779 INFO:clusterdock.topologies.cdh.cm:Detected CM API v13. INFO:clusterdock.topologies.cdh.cm_utils:Updating database configurations... INFO:clusterdock.topologies.cdh.cm:Updating NameNode references in Hive metastore...

dimaspivak commented 7 years ago

Unless it returns an error saying it's timed out, it looks like it's still running... How long do you let it stay at that NameNode references step before giving up on it? Also, wanna look in Cloudera Manager to see if the command is running?

lc2a commented 7 years ago

http://10.120.1.14:32781/cmf/process/67/logs?filename=stdout.log

HTTP ERROR 502 Problem accessing /cmf/process/67/logs. Reason: BAD_GATEWAY

lc2a commented 7 years ago

Login to CM, aborted Hive commands ... then INFO:clusterdock.topologies.cdh.cm:Updating NameNode references in Hive metastore... WARNING:clusterdock.topologies.cdh.cm:Failed to update NameNode references in Hive metastore (command returned : 'HiveUpdateLocationServiceCommand' (id: 231; active: False; success: False)). INFO:clusterdock.topologies.cdh.actions:Removing service ks_indexer from Cluster 1 (clusterdock)... INFO:clusterdock.topologies.cdh.actions:Removing service impala from Cluster 1 (clusterdock)... INFO:clusterdock.topologies.cdh.actions:Removing service hbase from Cluster 1 (clusterdock)... INFO:clusterdock.topologies.cdh.actions:Removing service solr from Cluster 1 (clusterdock) ... INFO:clusterdock.topologies.cdh.actions:Removing service spark_on_yarn from Cluster 1 (clusterdock)... INFO:clusterdock.topologies.cdh.actions:Removing service oozie from Cluster 1 (clusterdock)... INFO:clusterdock.topologies.cdh.actions:Removing service hue from Cluster 1 (clusterdock).. INFO:clusterdock.topologies.cdh.actions:Removing service hive from Cluster 1 (clusterdock) ... INFO:clusterdock.topologies.cdh.actions:Deploying client configuration... INFO:clusterdock.topologies.cdh.actions:We'd love to know what you think of our CDH topology for clusterdock! Please direct any feedback to our community forum at http://tiny.cloudera.com/hadoop-01-forum. INFO:start_cluster:CDH cluster started in 15 min, 22 sec.

But ..when i start HDFS ... also getting the same errors http://10.120.1.14:32781/cmf/process/68/logs?filename=stdout.log HTTP ERROR 502 Problem accessing /cmf/process/68/logs. Reason: BAD_GATEWAY Powered by Jetty://

lc2a commented 7 years ago

Reimage to Ubuntu 14. No problems at all.

Thanks