apache / fluo-muchos

Apache Fluo Muchos
https://fluo.apache.org
Apache License 2.0
26 stars 37 forks source link

Setup for existing cluster #388

Closed Viv1986 closed 3 years ago

Viv1986 commented 3 years ago

Hello,

could you specify how setup config files for existing cluster? where I should specify ips and their roles

eg cluster_type = existing

Viv1986 commented 3 years ago

@karthick-rn please help

arvindshmicrosoft commented 3 years ago

Not totally sure about your context, but assuming that you have a set of existing nodes where we want to setup Accumulo and related components, the details about those nodes go into two places:

Also, I'm open to your suggestions on how we can improve the README to make this clearer.

Viv1986 commented 3 years ago

ok, it was installed good, and seems no intrustion needed, could you please specify, I suddenly setup 12 tservers per server, and can't find where that value specified on end-system, how to fix it to 1? I also setup system scripts for starting service and seems it's started it by /etc/systemd/system/accumulo-tserver@.service

arvindshmicrosoft commented 3 years ago

Did you use the Muchos-provided option to launch systemd services, or did you implement your own? If you used Muchos' implementation, then the num_tservers setting controls how many tservers run per node.

karthick-rn commented 3 years ago

@Viv1986 By default, Muchos will setup 1 tablet server per worker host. Its not clear, how you ended with 12 tablet server per host! Assuming this is a dev/test cluster and the data can be regenerated, I'd suggest to ensure the following values are set in muchos.props file - num_tservers = 1 & use_systemd = True and then go-ahead and create a new cluster or you can run bin/muchos wipe -c <existingclustername> and re-run the setup as well. Let us know how it goes?

Viv1986 commented 3 years ago

@karthick-rn @arvindshmicrosoft ok, wipe seems not wipe everything, only killing some part of cluster, in any way, my current config its' 3 master and 12 workers, it's test cluster, and what I need it's hdfs and accumulo HA mode, when I use

"hdfs_ha = True"

I all time get on zkfc init error "HA is not enabled for this namenode.", how to fix it? here is my current config

leader1 = namenode,resourcemanager,accumulomaster,zookeeper,zkfc,journalnode leader2 = metrics,zookeeper,resourcemanager,zkfc,journalnode leader3 = zookeeper,resourcemanager,zkfc,journalnode worker1 = worker,swarmmanager worker2 = worker worker3 = worker worker4 = worker worker5 = worker worker6 = worker worker7 = worker worker8 = worker worker9 = worker worker10 = worker worker11 = worker worker12 = worker

karthick-rn commented 3 years ago

@Viv1986 If you're referring to the accumulo systemd units that were not wiped, I can see why and I'll fix this but other services should be removed successfully. Let us know if not? In regards to the HA, you have configured only 1 namenode - try setting as shown below. Ideally, you should have namenode & zkfc on the same hosts.

leader1 = namenode,resourcemanager,accumulomaster,zookeeper,journalnode,zkfc
leader2 = zookeeper,journalnode,namenode,zkfc,accumulomaster,resourcemanager
leader3 = journalnode,zookeeper
worker1 = worker
...
...
worker12 = worker 
Viv1986 commented 3 years ago

ok, working good, only bugs with tserver's start, they not starting after stop-starting accumulo through accumulo-cluster start or accumulo-cluster tserver-start, I start them direct thought /etc/systemd/system/accumulo-tserver@.service. Zookeeper on masters needs be started manually per master, should have central control like accumulo's. And last which left, do muchos have option for existing storage account to use it on hadoop accumulo?

karthick-rn commented 3 years ago

Actually the systemd commands were added to accumulo-cluster script as a convenience to handle start/stop of services in the cluster and was not part of the original source. As the problem is only with tserver restarts, may be there is something missing in the script, I'll look into that. For now, you can use something like sudo systemctl <start/stop> accumulo-tserver@1.service to start/stop tserver. On the ZK, I understand - we are trying to minimise making changes to the original scripts as it adds overhead in maintaining it. Instead of manually ssh'ing into each node, you can do the below.

for host in leader1 leader2 leader3;
do
echo $host;
ssh $host 'sh -c "zkServer.sh start"'
done

do muchos have option for existing storage account to use it on hadoop accumulo?

If you're referring to the ADLS Gen2 storage account, then you'll have to update the required ADLS Gen2 fields in muchos.props from the existing storage account and not leave them to default.

Viv1986 commented 3 years ago

@karthick-rn it started 12 per instance again without reason, seems problem in scripts, NUM_TSERVERS=$(grep -E -c -v '(^#|^\s*$)' "$TSERVERS")

accumulo-cluster start Starting tablet servers ............... done accumulo-tserver@1.service loaded active running TServer Service for Accumulo accumulo-tserver@10.service loaded active running TServer Service for Accumulo accumulo-tserver@11.service loaded active running TServer Service for Accumulo accumulo-tserver@12.service loaded active running TServer Service for Accumulo accumulo-tserver@2.service loaded activating auto-restart TServer Service for Accumulo accumulo-tserver@3.service loaded active running TServer Service for Accumulo accumulo-tserver@4.service loaded active running TServer Service for Accumulo accumulo-tserver@5.service loaded active running TServer Service for Accumulo accumulo-tserver@6.service loaded active running TServer Service for Accumulo accumulo-tserver@7.service loaded active running TServer Service for Accumulo accumulo-tserver@8.service loaded active running TServer Service for Accumulo accumulo-tserver@9.service loaded active running TServer Service for Accumulo accumulo-tserver@1.service loaded active running TServer Service for Accumulo accumulo-tserver@10.service loaded active running TServer Service for Accumulo accumulo-tserver@11.service loaded active running TServer Service for Accumulo accumulo-tserver@12.service loaded active running TServer Service for Accumulo accumulo-tserver@2.service loaded active running TServer Service for Accumulo accumulo-tserver@3.service loaded active running TServer Service for Accumulo accumulo-tserver@4.service loaded active running TServer Service for Accumulo accumulo-tserver@5.service loaded active running TServer Service for Accumulo accumulo-tserver@6.service loaded active running TServer Service for Accumulo accumulo-tserver@7.service loaded active running TServer Service for Accumulo accumulo-tserver@8.service loaded active running TServer Service for Accumulo accumulo-tserver@9.service loaded active running TServer Service for Accumulo accumulo-tserver@1.service loaded active running TServer Service for Accumulo accumulo-tserver@10.service loaded active running TServer Service for Accumulo accumulo-tserver@11.service loaded active running TServer Service for Accumulo accumulo-tserver@12.service loaded active running TServer Service for Accumulo accumulo-tserver@2.service loaded active running TServer Service for Accumulo accumulo-tserver@3.service loaded active running TServer Service for Accumulo accumulo-tserver@4.service loaded active running TServer Service for Accumulo accumulo-tserver@5.service loaded active running TServer Service for Accumulo accumulo-tserver@6.service loaded active running TServer Service for Accumulo accumulo-tserver@7.service loaded active running TServer Service for Accumulo accumulo-tserver@8.service loaded active running TServer Service for Accumulo accumulo-tserver@9.service loaded active running TServer Service for Accumulo accumulo-tserver@1.service loaded active running TServer Service for Accumulo accumulo-tserver@10.service loaded active running TServer Service for Accumulo accumulo-tserver@11.service loaded active running TServer Service for Accumulo accumulo-tserver@12.service

and it's not hang them, only manual by sudo systemctl <start/stop> accumulo-tserver@1.service

[evoamsadm@worker12 ~]$ ps aux | grep tser evoamsa+ 23347 4.0 0.2 5934088 328632 ? Ssl 10:50 0:05 /usr/lib/jvm/java/bin/java -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -XX:-OmitStackTraceInFastThrow -Djava.net.preferIPv4Stack=true -Daccumulo.native.lib.path=/home/evoamsadm/install/accumulo-2.0.1/lib/native -Xmx4G -Xms4G -Daccumulo.log.dir=/data1/logs/accumulo -Daccumulo.application=tserver5_worker12 -Dlog4j.configuration=log4j-service.properties org.apache.accumulo.start.Main tse ver evoamsa+ 23439 4.4 0.2 5935764 332940 ? Ssl 10:50 0:05 /usr/lib/jvm/java/bin/java -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -XX:-OmitStackTraceInFastThrow -Djava.net.preferIPv4Stack=true -Daccumulo.native.lib.path=/home/evoamsadm/install/accumulo-2.0.1/lib/native -Xmx4G -Xms4G -Daccumulo.log.dir=/data1/logs/accumulo -Daccumulo.application=tserver11_worker12 -Dlog4j.configuration=log4j-service.properties org.apache.accumulo.start.Main tserver evoamsa+ 23513 4.5 0.2 5935952 330864 ? Ssl 10:50 0:05 /usr/lib/jvm/java/bin/java -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -XX:-OmitStackTraceInFastThrow -Djava.net.preferIPv4Stack=true -Daccumulo.native.lib.path=/home/evoamsadm/install/accumulo-2.0.1/lib/native -Xmx4G -Xms4G -Daccumulo.log.dir=/data1/logs/accumulo -Daccumulo.application=tserver3_worker12 -Dlog4j.configuration=log4j-service.properties org.apache.accumulo.start.Main tse ver evoamsa+ 23603 5.1 0.2 5935672 327384 ? Ssl 10:50 0:05 /usr/lib/jvm/java/bin/java -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -XX:-OmitStackTraceInFastThrow -Djava.net.preferIPv4Stack=true -Daccumulo.native.lib.path=/home/evoamsadm/install/accumulo-2.0.1/lib/native -Xmx4G -Xms4G -Daccumulo.log.dir=/data1/logs/accumulo -Daccumulo.application=tserver12_worker12 -Dlog4j.configuration=log4j-service.properties org.apache.accumulo.start.Main tserver evoamsa+ 23684 5.1 0.2 5934968 329224 ? Ssl 10:50 0:05 /usr/lib/jvm/java/bin/java -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -XX:-OmitStackTraceInFastThrow -Djava.net.preferIPv4Stack=true -Daccumulo.native.lib.path=/home/evoamsadm/install/accumulo-2.0.1/lib/native -Xmx4G -Xms4G -Daccumulo.log.dir=/data1/logs/accumulo -Daccumulo.application=tserver4_worker12 -Dlog4j.configuration=log4j-service.properties org.apache.accumulo.start.Main tse ver evoamsa+ 23746 5.4 0.2 5933856 326084 ? Ssl 10:50 0:05 /usr/lib/jvm/java/bin/java -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -XX:-OmitStackTraceInFastThrow -Djava.net.preferIPv4Stack=true -Daccumulo.native.lib.path=/home/evoamsadm/install/accumulo-2.0.1/lib/native -Xmx4G -Xms4G -Daccumulo.log.dir=/data1/logs/accumulo -Daccumulo.application=tserver8_worker12 -Dlog4j.configuration=log4j-service.properties org.apache.accumulo.start.Main tse ver evoamsa+ 23840 5.4 0.2 5935216 329120 ? Ssl 10:50 0:05 /usr/lib/jvm/java/bin/java -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -XX:-OmitStackTraceInFastThrow -Djava.net.preferIPv4Stack=true -Daccumulo.native.lib.path=/home/evoamsadm/install/accumulo-2.0.1/lib/native -Xmx4G -Xms4G -Daccumulo.log.dir=/data1/logs/accumulo -Daccumulo.application=tserver9_worker12 -Dlog4j.configuration=log4j-service.properties org.apache.accumulo.start.Main tse ver evoamsa+ 23939 7.2 0.2 5933828 326200 ? Ssl 10:51 0:05 /usr/lib/jvm/java/bin/java -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -XX:-OmitStackTraceInFastThrow -Djava.net.preferIPv4Stack=true -Daccumulo.native.lib.path=/home/evoamsadm/install/accumulo-2.0.1/lib/native -Xmx4G -Xms4G -Daccumulo.log.dir=/data1/logs/accumulo -Daccumulo.application=tserver10_worker12 -Dlog4j.configuration=log4j-service.properties org.apache.accumulo.start.Main tserver evoamsa+ 24049 10.4 0.2 5934296 328616 ? Ssl 10:51 0:05 /usr/lib/jvm/java/bin/java -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -XX:-OmitStackTraceInFastThrow -Djava.net.preferIPv4Stack=true -Daccumulo.native.lib.path=/home/evoamsadm/install/accumulo-2.0.1/lib/native -Xmx4G -Xms4G -Daccumulo.log.dir=/data1/logs/accumulo -Daccumulo.application=tserver2_worker12 -Dlog4j.configuration=log4j-service.properties org.apache.accumulo.start.Main tse ver evoamsa+ 24128 12.5 0.2 5936716 330576 ? Ssl 10:51 0:05 /usr/lib/jvm/java/bin/java -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -XX:-OmitStackTraceInFastThrow -Djava.net.preferIPv4Stack=true -Daccumulo.native.lib.path=/home/evoamsadm/install/accumulo-2.0.1/lib/native -Xmx4G -Xms4G -Daccumulo.log.dir=/data1/logs/accumulo -Daccumulo.application=tserver7_worker12 -Dlog4j.configuration=log4j-service.properties org.apache.accumulo.start.Main tse ver evoamsa+ 24202 12.7 0.2 5935424 329248 ? Ssl 10:51 0:05 /usr/lib/jvm/java/bin/java -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -XX:-OmitStackTraceInFastThrow -Djava.net.preferIPv4Stack=true -Daccumulo.native.lib.path=/home/evoamsadm/install/accumulo-2.0.1/lib/native -Xmx4G -Xms4G -Daccumulo.log.dir=/data1/logs/accumulo -Daccumulo.application=tserver6_worker12 -Dlog4j.configuration=log4j-service.properties org.apache.accumulo.start.Main tse ver evoamsa+ 24525 42.8 0.2 5933188 327572 ? Ssl 10:52 0:05 /usr/lib/jvm/java/bin/java -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -XX:-OmitStackTraceInFastThrow -Djava.net.preferIPv4Stack=true -Daccumulo.native.lib.path=/home/evoamsadm/install/accumulo-2.0.1/lib/native -Xmx4G -Xms4G -Daccumulo.log.dir=/data1/logs/accumulo -Daccumulo.application=tserver1_worker12 -Dlog4j.configuration=log4j-service.properties org.apache.accumulo.start.Main tse ver

karthick-rn commented 3 years ago

NUM_TSERVERS=$(grep -E -c -v '(^#|^\s*$)' "$TSERVERS")

From which script you got this line? I don't see this line in the accumulo-cluster script at all. Also, I have a similar setup with systemd which has 1 tserver per worker (total -4) and don't see this problem.

[user1@host1 ~]$ accumulo-cluster start Starting tablet servers ....... done accumulo-tserver@1.service loaded active running TServer Service for Accumulo accumulo-tserver@1.service loaded active running TServer Service for Accumulo accumulo-tserver@1.service loaded active running TServer Service for Accumulo accumulo-tserver@1.service loaded active running TServer Service for Accumulo

[user1@host3 ~]$ ps aux | grep tserver user1+ 29585 6.8 1.0 3738804 349168 ? Ssl 11:36 0:04 /usr/lib/jvm/java/bin/java -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -XX:-OmitStackTraceInFastThrow -Djava.net.preferIPv4Stack=true -Daccumulo.native.lib.path=/home/user1/install/accumulo-2.0.1/lib/native -Xmx2G -Xms2G -Daccumulo.log.dir=/var/data1/logs/accumulo -Daccumulo.application=tserver1_host3 -Dlog4j.configuration=log4j-service.properties org.apache.accumulo.start.Main tserver

Viv1986 commented 3 years ago

install/accumulo-2.0.1/bin/accumulo-util

karthick-rn commented 3 years ago

accumulo-util script is different, I don't think it is involved in the start/stop of services.

Viv1986 commented 3 years ago

num_tservers initialization come from that file, so I think it's involved, cuz if I will correct it NUM_TSERVERS=1

it will fix problem

Viv1986 commented 3 years ago

@karthick-rn @arvindshmicrosoft ok where I should put account key for storage account?

instance_volumes_input = abfss://xxxx-test@xxxxstorage.blob.core.windows.net instance_volumes_adls = adls_storage_type = Standard_LRS user_assigned_identity = azure_tenant_id = azure_client_id = principal_id =

karthick-rn commented 3 years ago

instance_volumes_input = abfss://xxxx-test@xxxxstorage.blob.core.windows.net

The correct endpoint for ADLS Gen2 URI is dfs.core.windows.net and not blob.core.windows.net. If this is a blob storage account then I doubt it will work as Muchos only supports ADLS Gen2 currently.

Viv1986 commented 3 years ago

ok, but it's closed by key, where I should put that key?

Viv1986 commented 3 years ago

@karthick-rn @arvindshmicrosoft can you help me with access key? where I should put it?

karthick-rn commented 3 years ago

@karthick-rn @arvindshmicrosoft can you help me with access key? where I should put it?

In Muchos, authentication for ADLS Gen2 is done via User Assigned Managed Identity. If you have access key, then currently we don't support it however you can create a user assigned managed identity, add it to the storage account and assign the Storage Blob Data Owner role(1) & update the value in user_assigned_identity found in muchos.props file. The launch step will then take care of adding the identity to all the hosts in the VM scalesets. The below link will help you in the creation of managed identity. (1) - https://docs.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-manage-ua-identity-portal

Viv1986 commented 3 years ago

@karthick-rn actually it works if I do by https://accumulo.apache.org/blog/2019/10/15/accumulo-adlsgen2-notes.html and https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html#Default:_Shared_Key

karthick-rn commented 3 years ago

@Viv1986 Good you got it working. The fact I was emphasising is more from a muchos standpoint & currently what it supports.

arvindshmicrosoft commented 3 years ago

@Viv1986 - would you mind if we close this; it looks like you were able to address the original issue in this thread. Correct?

Viv1986 commented 3 years ago

yeap