d2iq-archive / dcos-flink-service

11 stars 17 forks source link

Task Manager Mesos task fails to run when using a custom docker image with either the mesos or the docker containerizers #54

Open asicoe opened 6 years ago

asicoe commented 6 years ago

Using DCOS 1.10 and dcos-flink-service 1.3.1-1.2.1

Setting either: -Dmesos.resourcemanager.tasks.container.type=docker -Dmesos.resourcemanager.tasks.container.image.name=custom_image:tag or -Dmesos.resourcemanager.tasks.container.type=mesos -Dmesos.resourcemanager.tasks.container.image.name=custom_image:tag

Job Manager std out:

INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Mesos task taskmanager-01513 failed, with a TaskManager in launch or registration. State: TASK_FAILED Reason: REASON_COMMAND_EXECUTOR_FAILED (Container exited with status 127)
INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Diagnostics for task taskmanager-01513 in state TASK_FAILED : reason=REASON_COMMAND_EXECUTOR_FAILED message=Container exited with status 127

Task manager std out:

I0424 17:43:40.646013 17393 fetcher.cpp:533] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/root","items":[{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c459-stop-zooke_-quorum.sh","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/stop-zookeeper-quorum.sh","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/stop-zookeeper-quorum.sh"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c464-log4j-cli.properties","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/log4j-cli.properties","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/conf\/log4j-cli.properties"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c470-logback.xml","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/logback.xml","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/conf\/logback.xml"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c480-flink-metr_-1.3.2.jar","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/lib\/flink-metrics-statsd-1.3.2.jar","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/lib\/flink-metrics-statsd-1.3.2.jar"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c462-zookeeper.sh","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/zookeeper.sh","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/zookeeper.sh"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c460-taskmanager.sh","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/taskmanager.sh","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/taskmanager.sh"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c449-mesos-taskmanager.sh","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/mesos-taskmanager.sh","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/mesos-taskmanager.sh"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c458-stop-local.sh","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/stop-local.sh","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/stop-local.sh"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c475-flink-dist_-1.3.2.jar","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/lib\/flink-dist_2.11-1.3.2.jar","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/lib\/flink-dist_2.11-1.3.2.jar"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c457-stop-cluster.sh","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/stop-cluster.sh","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/stop-cluster.sh"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c476-flink-pyth_-1.3.2.jar","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/lib\/flink-python_2.11-1.3.2.jar","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/lib\/flink-python_2.11-1.3.2.jar"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c456-start-zook_-quorum.sh","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/start-zookeeper-quorum.sh","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/start-zookeeper-quorum.sh"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c455-start-scala-shell.sh","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/start-scala-shell.sh","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/start-scala-shell.sh"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c453-start-local.bat","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/start-local.bat","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/start-local.bat"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c446-historyserver.sh","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/historyserver.sh","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/historyserver.sh"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c452-start-cluster.sh","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/start-cluster.sh","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/start-cluster.sh"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c471-masters","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/masters","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/conf\/masters"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c442-flink","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/flink","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/flink"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c465-log4j-cons_properties","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/log4j-console.properties","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/conf\/log4j-console.properties"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c467-log4j.properties","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/log4j.properties","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/conf\/log4j.properties"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c468-logback-console.xml","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/logback-console.xml","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/conf\/logback-console.xml"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c469-logback-yarn.xml","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/logback-yarn.xml","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/conf\/logback-yarn.xml"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c466-log4j-yarn_properties","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/log4j-yarn-session.properties","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/conf\/log4j-yarn-session.properties"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c474-log4j.propertiese","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/log4j.propertiese","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/conf\/log4j.propertiese"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c472-slaves","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/slaves","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/conf\/slaves"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c478-log4j-1.2.17.jar","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/lib\/log4j-1.2.17.jar","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/lib\/log4j-1.2.17.jar"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c473-zoo.cfg","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/zoo.cfg","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/conf\/zoo.cfg"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c479-slf4j-log4_-1.7.7.jar","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/lib\/slf4j-log4j12-1.7.7.jar","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/lib\/slf4j-log4j12-1.7.7.jar"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c463-flink-conf.yaml","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/flink-conf.yaml","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/conf\/flink-conf.yaml"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c445-flink.bat","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/flink.bat","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/flink.bat"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c450-pyflink.bat","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/pyflink.bat","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/pyflink.bat"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c441-config.sh","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/config.sh","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/config.sh"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c447-jobmanager.sh","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/jobmanager.sh","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/jobmanager.sh"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c477-flink-shad_-1.3.2.jar","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/lib\/flink-shaded-hadoop2-uber-1.3.2.jar","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/lib\/flink-shaded-hadoop2-uber-1.3.2.jar"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c448-mesos-appmaster.sh","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/mesos-appmaster.sh","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/mesos-appmaster.sh"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c443-flink-console.sh","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/flink-console.sh","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/flink-console.sh"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c461-yarn-session.sh","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/yarn-session.sh","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/yarn-session.sh"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c444-flink-daemon.sh","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/flink-daemon.sh","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/flink-daemon.sh"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c451-pyflink.sh","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/pyflink.sh","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/pyflink.sh"}},{"action":"RETRIEVE_FROM_CACHE","cache_filename":"c454-start-local.sh","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/start-local.sh","value":"http:\/\/my-host:32887\/2a481d2e-3130-4386-8b6f-176247f8ba92\/flink\/bin\/start-local.sh"}}],"sandbox_directory":"\/var\/lib\/mesos\/slave\/slaves\/01354a28-9c65-4f29-8218-b4e5401d5801-S2\/frameworks\/6ab6b470-559c-4aab-8a23-1083cc7ca62c-0000\/executors\/taskmanager-00086\/runs\/e4819666-63fe-4a32-9356-34c7909d093c","user":"root"}
I0424 17:43:40.652581 17393 fetcher.cpp:444] Fetching URI 'http://my-host:32887/2a481d2e-3130-4386-8b6f-176247f8ba92/flink/bin/stop-zookeeper-quorum.sh'
I0424 17:43:40.652626 17393 fetcher.cpp:341] Fetching from cache
I0424 17:43:40.655618 17393 fetcher.cpp:207] Copied resource '/tmp/mesos/fetch/root/c459-stop-zooke_-quorum.sh' to '/var/lib/mesos/slave/slaves/01354a28-9c65-4f29-8218-b4e5401d5801-S2/frameworks/6ab6b470-559c-4aab-8a23-1083cc7ca62c-0000/executors/taskmanager-00086/runs/e4819666-63fe-4a32-9356-34c7909d093c/flink/bin/stop-zookeeper-quorum.sh'
I0424 17:43:40.655681 17393 fetcher.cpp:582] Fetched 'http://my-host:32887/2a481d2e-3130-4386-8b6f-176247f8ba92/flink/bin/stop-zookeeper-quorum.sh' to '/var/lib/mesos/slave/slaves/01354a28-9c65-4f29-8218-b4e5401d5801-S2/frameworks/6ab6b470-559c-4aab-8a23-1083cc7ca62c-0000/executors/taskmanager-00086/runs/e4819666-63fe-4a32-9356-34c7909d093c/flink/bin/stop-zookeeper-quorum.sh'
...
I0424 17:43:40.852995 17393 fetcher.cpp:341] Fetching from cache
I0424 17:43:40.855098 17393 fetcher.cpp:207] Copied resource '/tmp/mesos/fetch/root/c451-pyflink.sh' to '/var/lib/mesos/slave/slaves/01354a28-9c65-4f29-8218-b4e5401d5801-S2/frameworks/6ab6b470-559c-4aab-8a23-1083cc7ca62c-0000/executors/taskmanager-00086/runs/e4819666-63fe-4a32-9356-34c7909d093c/flink/bin/pyflink.sh'
I0424 17:43:40.855152 17393 fetcher.cpp:582] Fetched 'http://my-host:32887/2a481d2e-3130-4386-8b6f-176247f8ba92/flink/bin/pyflink.sh' to '/var/lib/mesos/slave/slaves/01354a28-9c65-4f29-8218-b4e5401d5801-S2/frameworks/6ab6b470-559c-4aab-8a23-1083cc7ca62c-0000/executors/taskmanager-00086/runs/e4819666-63fe-4a32-9356-34c7909d093c/flink/bin/pyflink.sh'
I0424 17:43:40.855163 17393 fetcher.cpp:444] Fetching URI 'http://my-host:32887/2a481d2e-3130-4386-8b6f-176247f8ba92/flink/bin/start-local.sh'
I0424 17:43:40.855175 17393 fetcher.cpp:341] Fetching from cache
I0424 17:43:40.857486 17393 fetcher.cpp:207] Copied resource '/tmp/mesos/fetch/root/c454-start-local.sh' to '/var/lib/mesos/slave/slaves/01354a28-9c65-4f29-8218-b4e5401d5801-S2/frameworks/6ab6b470-559c-4aab-8a23-1083cc7ca62c-0000/executors/taskmanager-00086/runs/e4819666-63fe-4a32-9356-34c7909d093c/flink/bin/start-local.sh'
I0424 17:43:40.857542 17393 fetcher.cpp:582] Fetched 'http://my-host:32887/2a481d2e-3130-4386-8b6f-176247f8ba92/flink/bin/start-local.sh' to '/var/lib/mesos/slave/slaves/01354a28-9c65-4f29-8218-b4e5401d5801-S2/frameworks/6ab6b470-559c-4aab-8a23-1083cc7ca62c-0000/executors/taskmanager-00086/runs/e4819666-63fe-4a32-9356-34c7909d093c/flink/bin/start-local.sh'
I0424 17:43:41.259465 17572 exec.cpp:162] Version: 1.4.2
I0424 17:43:41.264185 17595 exec.cpp:236] Executor registered on agent 01354a28-9c65-4f29-8218-b4e5401d5801-S2
I0424 17:43:41.265388 17604 executor.cpp:120] Registered docker executor on 10.1.10.19
I0424 17:43:41.265950 17597 executor.cpp:160] Starting task taskmanager-00086
/bin/sh: 1: flink/bin/mesos-taskmanager.sh: not found
I0424 17:43:43.164495 17605 process.cpp:1068] Failed to accept socket: future discarded
Same observation as above, it is copying files from tmp to the sandbox but those files do not seem to be mounted in the docker container and the root path be used correctly.

The above is the same as this closed issue /bin/sh: 1: flink/bin/mesos-taskmanager.sh: not found

I am reopening it here to see if there is any workaround.

The command actually ran is $FLINK_HOME/bin/mesos-taskmanager.sh and it seems $FLINK_HOME is set to "flink". So the intention seems for this to be a relative path to some working directory that is not correctly set.

asicoe commented 6 years ago

So a workaround is to set the Java property: mesos.resourcemanager.tasks.bootstrap-cmd=FLINK_HOME=/mnt/mesos/sandbox/flink

EronWright commented 6 years ago

I suspect that the root cause is that the docker image has set a working directory, but Flink assumes that it is being launched from the sandbox directory. Usually that assumption is correct.

The reason why FLINK_HOME is set to flink is because the entire flink distribution is automatically copied into the sandbox directory; note the log files related to "fetching" above.

I would characterize this issue as a feature request to support an arbitrary working directory.

asicoe commented 6 years ago

I'm guessing that there should be a feature request on the flink project and not this one correct?