Closed jhpoelen closed 6 years ago
related to #23
downloading the job jar from the jupyter terminal through jupyter.idigbio.org shows download speed of ~ 200-100KB/s . This seems unexpectedly slow. See screenshot.
@godfoder @mjcollin I've tried to use put job into hdfs and have mesos grab them from there. . . but alas -
I1114 17:29:20.400023 5195 fetcher.cpp:498] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/dabbc59d-84c3-4fb3-8baa-045c55e020fb-S8\/root","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"hdfs:\/\/\/guoda\/lib\/iDigBio-LD-assembly-1.5.8.jar"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/dabbc59d-84c3-4fb3-8baa-045c55e020fb-S8\/frameworks\/8d894434-c261-4d4f-9595-0ae41bac67eb-0345\/executors\/driver-20171114172920-58390\/runs\/9a97a40b-84b2-4e3c-afb7-fa42d13073b0","user":"root"}
I1114 17:29:20.402480 5195 fetcher.cpp:409] Fetching URI 'hdfs:///guoda/lib/iDigBio-LD-assembly-1.5.8.jar'
I1114 17:29:20.402496 5195 fetcher.cpp:250] Fetching directly into the sandbox directory
I1114 17:29:20.402516 5195 fetcher.cpp:187] Fetching URI 'hdfs:///guoda/lib/iDigBio-LD-assembly-1.5.8.jar'
E1114 17:29:20.704540 5195 shell.hpp:106] Command 'hadoop version 2>&1' failed; this is the output:
/usr/bin/hadoop: line 27: /usr/bin/../libexec/hadoop-config.sh: No such file or directory
/usr/bin/hadoop: line 166: exec: : not found
Failed to fetch 'hdfs:///guoda/lib/iDigBio-LD-assembly-1.5.8.jar': Failed to create HDFS client: Failed to execute 'hadoop version 2>&1'; the command was either not found or exited with a non-zero exit status: 127
Failed to synchronize with agent (it's probably exited)
any ideas?
Yes, the hadoop script expects an environment variable containing part of the program path to be set and handles it poorly when it's empty.
Source /etc/profile.d/hadoop.sh if possible or if not, here are the variables to set:
mjcollin@idb-jupyter1:~$ cat /etc/profile.d/hadoop.sh
export HADOOP_PREFIX=/usr/lib/hadoop
export HADOOP_HOME=/usr/lib/hadoop
export HADOOP_CONF_DIR=/etc/hadoop/conf
# The hadoop shell script is foolish about detecting this dir, set it manually
export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
#Local GUODA settings
export HADOOP_USER_NAME=hdfs
Thanks for info. These are the settings I use for job to talk to hdfs after being launched (https://github.com/bio-guoda/effechecka/blob/56a09acf5352476fa4fa73503665fb0f7df455a0/src/main/scala/effechecka/SparkSubmitter.scala#L80). However, this particular issue needs the variables to be available prior to launching the spark job, in the "bare" mesos container on a mesos agent. I am assuming that @mjcollin @godfoder control these environment variables.
For some reason, the download from amazon s3 are now functioning again.
Was there some acis ufl network issue?
Not that I know of. It seems to me that this has happened before though, I remember looking at logs from a previous time when jobs were not finishing and seeing that the (I think) wget line was the last line when starting up the job. I had no trouble getting the jar on my desktop but I could see a number of attempted jobs starts on the cluster where it seemed to fail.
Hmm. Perhaps there's some throttling happening on ufl or s3 side.
Any chance you can set hdfs environment variables on mesos agent machines? I don't think I can set these properties otherwise.
For what the data is worth: I was able to upload a fresh copy of iNat this morning on the second try, and unzip it. I believe two attempts at makeParquet have failed.
In can confirm that jobs are again stalling due to apparent download issues.
@mjcollin @godfoder please configure mesos agents for hdfs access.
For some miraculous reason, the most recent checklist job is running ok. Any idea what changed in the last 10 minutes?
last known fail
I1117 14:46:29.152925 27876 fetcher.cpp:498] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/dabbc59d-84c3-4fb3-8baa-045c55e020fb-S0\/root","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"https:\/\/s3-us-west-2.amazonaws.com\/guoda\/idigbio-spark\/iDigBio-LD-assembly-1.5.8.jar"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/dabbc59d-84c3-4fb3-8baa-045c55e020fb-S0\/frameworks\/8d894434-c261-4d4f-9595-0ae41bac67eb-0345\/executors\/driver-20171117144629-91700\/runs\/e9ee7ae9-2a7e-4767-81f7-2cbb700a3104","user":"root"}
I1117 14:46:29.155643 27876 fetcher.cpp:409] Fetching URI 'https://s3-us-west-2.amazonaws.com/guoda/idigbio-spark/iDigBio-LD-assembly-1.5.8.jar'
I1117 14:46:29.155660 27876 fetcher.cpp:250] Fetching directly into the sandbox directory
I1117 14:46:29.155689 27876 fetcher.cpp:187] Fetching URI 'https://s3-us-west-2.amazonaws.com/guoda/idigbio-spark/iDigBio-LD-assembly-1.5.8.jar'
I1117 14:46:29.155711 27876 fetcher.cpp:134] Downloading resource from 'https://s3-us-west-2.amazonaws.com/guoda/idigbio-spark/iDigBio-LD-assembly-1.5.8.jar' to '/var/lib/mesos/slaves/dabbc59d-84c3-4fb3-8baa-045c55e020fb-S0/frameworks/8d894434-c261-4d4f-9595-0ae41bac67eb-0345/executors/driver-20171117144629-91700/runs/e9ee7ae9-2a7e-4767-81f7-2cbb700a3104/iDigBio-LD-assembly-1.5.8.jar'
current success
[...]
I1117 14:51:31.551918 26058 fetcher.cpp:498] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/ea73f349-1248-4417-85dd-844276310f46-S0\/root","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"https:\/\/s3-us-west-2.amazonaws.com\/guoda\/idigbio-spark\/iDigBio-LD-assembly-1.5.8.jar"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/ea73f349-1248-4417-85dd-844276310f46-S0\/frameworks\/8d894434-c261-4d4f-9595-0ae41bac67eb-0345\/executors\/driver-20171117145131-91701\/runs\/572f634b-cf59-4f8f-b094-4d9aa0ae6513","user":"root"}
I1117 14:51:31.554407 26058 fetcher.cpp:409] Fetching URI 'https://s3-us-west-2.amazonaws.com/guoda/idigbio-spark/iDigBio-LD-assembly-1.5.8.jar'
I1117 14:51:31.554424 26058 fetcher.cpp:250] Fetching directly into the sandbox directory
I1117 14:51:31.554445 26058 fetcher.cpp:187] Fetching URI 'https://s3-us-west-2.amazonaws.com/guoda/idigbio-spark/iDigBio-LD-assembly-1.5.8.jar'
I1117 14:51:31.554460 26058 fetcher.cpp:134] Downloading resource from 'https://s3-us-west-2.amazonaws.com/guoda/idigbio-spark/iDigBio-LD-assembly-1.5.8.jar' to '/var/lib/mesos/slaves/ea73f349-1248-4417-85dd-844276310f46-S0/frameworks/8d894434-c261-4d4f-9595-0ae41bac67eb-0345/executors/driver-20171117145131-91701/runs/572f634b-cf59-4f8f-b094-4d9aa0ae6513/iDigBio-LD-assembly-1.5.8.jar'
W1117 14:52:19.419237 26058 fetcher.cpp:289] Copying instead of extracting resource from URI with 'extract' flag, because it does not seem to be an archive: https://s3-us-west-2.amazonaws.com/guoda/idigbio-spark/iDigBio-LD-assembly-1.5.8.jar
I1117 14:52:19.419298 26058 fetcher.cpp:547] Fetched 'https://s3-us-west-2.amazonaws.com/guoda/idigbio-spark/iDigBio-LD-assembly-1.5.8.jar' to '/var/lib/mesos/slaves/ea73f349-1248-4417-85dd-844276310f46-S0/frameworks/8d894434-c261-4d4f-9595-0ae41bac67eb-0345/executors/driver-20171117145131-91701/runs/572f634b-cf59-4f8f-b094-4d9aa0ae6513/iDigBio-LD-assembly-1.5.8.jar'
I1117 14:52:19.549391 26068 exec.cpp:161] Version: 1.0.0
I1117 14:52:19.550938 26071 exec.cpp:236] Executor registered on agent ea73f349-1248-4417-85dd-844276310f46-S0
Warning: Ignoring non-spark config property: _spark.eventLog.enabled=true
17/11/17 14:52:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-11-17 14:52:24,081:26086(0x7ff0935f0700):ZOO_INFO@log_env@726: Client environment:zookeeper.version=zookeeper C client 3.4.8
2017-11-17 14:52:24,081:26086(0x7ff0935f0700):ZOO_INFO@log_env@730: Client environment:host.name=mesos01
2017-11-17 14:52:24,081:26086(0x7ff0935f0700):ZOO_INFO@log_env@737: Client environment:os.name=Linux
2017-11-17 14:52:24,081:26086(0x7ff0935f0700):ZOO_INFO@log_env@738: Client environment:os.arch=4.4.0-87-generic
2017-11-17 14:52:24,081:26086(0x7ff0935f0700):ZOO_INFO@log_env@739: Client environment:os.version=#110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017
2017-11-17 14:52:24,081:26086(0x7ff0935f0700):ZOO_INFO@log_env@747: Client environment:user.name=(null)
2017-11-17 14:52:24,081:26086(0x7ff0935f0700):ZOO_INFO@log_env@755: Client environment:user.home=/root
2017-11-17 14:52:24,081:26086(0x7ff0935f0700):ZOO_INFO@log_env@767: Client environment:user.dir=/var/lib/mesos/slaves/ea73f349-1248-4417-85dd-844276310f46-S0/frameworks/8d894434-c261-4d4f-9595-0ae41bac67eb-0345/executors/driver-20171117145131-91701/runs/572f634b-cf59-4f8f-b094-4d9aa0ae6513
2017-11-17 14:52:24,081:26086(0x7ff0935f0700):ZOO_INFO@zookeeper_init@800: Initiating client connection, host=mesos02:2181,mesos01:2181,mesos03:2181 sessionTimeout=10000 watcher=0x7ff1a2697400 sessionId=0 sessionPasswd=<null> context=0x7ff054000998 flags=0
2017-11-17 14:52:24,083:26086(0x7ff0923e7700):ZOO_INFO@check_events@1728: initiated connection to server [10.13.44.15:2181]
I1117 14:52:24.102135 26220 sched.cpp:226] Version: 1.0.0
2017-11-17 14:52:24,118:26086(0x7ff0923e7700):ZOO_INFO@check_events@1775: session establishment complete on server [10.13.44.15:2181], sessionId=0x15f9d9055370068, negotiated timeout=10000
I1117 14:52:24.119119 26217 group.cpp:349] Group process (group(1)@10.13.44.15:45886) connected to ZooKeeper
I1117 14:52:24.119171 26217 group.cpp:837] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I1117 14:52:24.119189 26217 group.cpp:427] Trying to create path '/mesos' in ZooKeeper
I1117 14:52:24.120016 26215 detector.cpp:152] Detected a new leader: (id='1754')
I1117 14:52:24.120133 26206 group.cpp:706] Trying to get '/mesos/json.info_0000001754' in ZooKeeper
I1117 14:52:24.120646 26210 zookeeper.cpp:259] A new leading master (UPID=master@10.13.44.18:5050) is detected
I1117 14:52:24.120724 26208 sched.cpp:330] New master detected at master@10.13.44.18:5050
I1117 14:52:24.120877 26208 sched.cpp:341] No credentials provided. Attempting to register without authentication
I1117 14:52:24.122133 26209 sched.cpp:743] Framework registered with 8d894434-c261-4d4f-9595-0ae41bac67eb-0345-driver-20171117145131-91701
17/11/17 14:52:24 WARN MesosCoarseGrainedSchedulerBackend: Unable to parse into a key:value label for the task.
17/11/17 14:52:24 WARN MesosCoarseGrainedSchedulerBackend: Unable to parse into a key:value label for the task.
17/11/17 14:52:47 WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints (spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance.
[Stage 2:> (0 + 0) / 2232]
[...]
update: I got three successful resource file updates today for fresh data. Feeling bold, I attempted an UpdateMonitors, which failed. I'll try that again tonight, hoping cluster traffic is a factor and perhaps it'll be lower then.
update: yesterday's updateMonitors completed in 13 h 7 min. The cluster seemed pretty quiet otherwise.
great! Closing issue.
As of yesterday, effechecka/freshdata jobs are failing. It seems that the download of job jar from amazon s3 fails somehow.
Expected size of job jar ~ 45 MB, and in staging directories same job jars << 45 MB are found. See attached success / failed workspace screenshot.
success with jar ~ 45 MB
failed with jar ~ 10 MB
example from stderr