bytedance / CloudShuffleService

Cloud Shuffle Service(CSS) is a general purpose remote shuffle solution for compute engines, including Spark/Flink/MapReduce.
Apache License 2.0
247 stars 57 forks source link

cannot run css cluster #5

Open Lobo2008 opened 1 year ago

Lobo2008 commented 1 year ago
  1. if I use all default settings, and start as sbin/start-all.sh, a Worker and a Master are running but when I submit a Spark app, it throw Caused by: java.lang.RuntimeException: replica num must less than worker num

  2. if I run zk mode by changing conf/css-default.cnf as:

    css.zookeeper.address=MyZkIP:2181
    css.worker.registry.type=zookeeper

    and start as sbin/start-workers.sh or sbin/start-worker.sh or sbin/start-all.sh it throw

    com.bytedance.css.service.deploy.worker.Worker --host xxx07v.xxxx.net
    failed to launch: nice -n 0 /yy/java8/bin/java -Xmx1024m -XX:MaxDirectMemorySize=4096m -Dcss.log.dir=/home/aa/css/logs -Dcss.log.filename=css-aa-worker-1.out -classpath /yyy/java8/lib:/home/aa/css/lib/* com.bytedance.css.service.deploy.worker.Worker --host  xxx07v.xxxx.net
    tail: 无法打开"/home/aa/css/logs/css-aa-worker-1.out" 读取数据: 没有那个文件或目录
    full log in /home/aa/css/logs/css-aa-worker-1.out
  3. if I deploy as the README.md and set 3 workers in conf/workers + zk mode and start-workers.sh then entering my password of the 3 workers, it returns pemission denied I am sure my password is correct.

Any suggestion? I think the README is ambiguous

bdyx123 commented 1 year ago
  1. css push data should use two replicas, so we should start two workers at least
  2. does dir /home/aa/css/logs exist? or dir permission issues?
  3. the machine which exec start-workers.sh should be set ssh-without-password with all of the workers
Lobo2008 commented 1 year ago
  1. css push data should use two replicas, so we should start two workers at least
  2. does dir /home/aa/css/logs exist? or dir permission issues?
  3. the machine which exec start-workers.sh should be set ssh-without-password with all of the workers

no permission issues because I change nothing and start as start-all.sh can produce the corresponding master and worker log, but when I change to zk, it failed.

I have 3 nodes with ip IP_A,IP_B,IP_C and want to use zk mode

then the dir is sent to the 3 nodes, how should I change other settings and run the scripts to make them work ?

I suppose run start-workers.sh on one of the 3 nodes should work, css will check the workers to start all the 3 workers. OR run start-worker.sh on every node and the worker will start its worker process(at this point, the other 2 IP should be deleted?)

bdyx123 commented 1 year ago

yes, start-workers.sh can work, haven't you start it?

a140262 commented 1 year ago

My CSS cluster is up running with the zookeeper registry type in k8s now. Everything looks fine until I run a Spark app. The application log shows the same error message:

java.lang.RuntimeException: replica num must less than worker num

The stats in my zookeeper is

[zk: localhost:2181(CONNECTED) 4] ls /css/my2css/workers
[css-0:39477:32875:35149, css-0:41865:46557:46199, css-0:43149:36579:33897, css-0:46573:36469:44793, css-0:46679:46533:41791, css-1:35421:36815:43879, css-1:39127:39883:44297, css-1:42185:42751:44815, css-1:43769:41983:33951]

the environment variable has set export CSS_WORKER_INSTANCES=2 Could you please let us know which configuration sets the replica number and which one measures the worker number? anything else I have missed?