job fail apparently due to incomplete job jar download

jhpoelen commented 6 years ago

As of yesterday, effechecka/freshdata jobs are failing. It seems that the download of job jar from amazon s3 fails somehow.

Expected size of job jar ~ 45 MB, and in staging directories same job jars << 45 MB are found. See attached success / failed workspace screenshot.

success with jar ~ 45 MB screenshot from 2017-11-14 14-15-01

failed with jar ~ 10 MB screenshot from 2017-11-14 14-15-18

example from stderr


I1114 17:06:33.839841  5905 fetcher.cpp:498] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/dabbc59d-84c3-4fb3-8baa-045c55e020fb-S1\/root","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"https:\/\/s3-us-west-2.amazonaws.com\/guoda\/idigbio-spark\/iDigBio-LD-assembly-1.5.8.jar"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/dabbc59d-84c3-4fb3-8baa-045c55e020fb-S1\/frameworks\/8d894434-c261-4d4f-9595-0ae41bac67eb-0345\/executors\/driver-20171114170632-58381\/runs\/dd5545a8-f29f-47f9-adae-db6d0f815b7c","user":"root"}
I1114 17:06:33.842356  5905 fetcher.cpp:409] Fetching URI 'https://s3-us-west-2.amazonaws.com/guoda/idigbio-spark/iDigBio-LD-assembly-1.5.8.jar'
I1114 17:06:33.842372  5905 fetcher.cpp:250] Fetching directly into the sandbox directory
I1114 17:06:33.842392  5905 fetcher.cpp:187] Fetching URI 'https://s3-us-west-2.amazonaws.com/guoda/idigbio-spark/iDigBio-LD-assembly-1.5.8.jar'
I1114 17:06:33.842407  5905 fetcher.cpp:134] Downloading resource from 'https://s3-us-west-2.amazonaws.com/guoda/idigbio-spark/iDigBio-LD-assembly-1.5.8.jar' to '/var/lib/mesos/slaves/dabbc59d-84c3-4fb3-8baa-045c55e020fb-S1/frameworks/8d894434-c261-4d4f-9595-0ae41bac67eb-0345/executors/driver-20171114170632-58381/runs/dd5545a8-f29f-47f9-adae-db6d0f815b7c/iDigBio-LD-assembly-1.5.8.jar'

jhpoelen commented 6 years ago

related to #23

jhpoelen commented 6 years ago

downloading the job jar from the jupyter terminal through jupyter.idigbio.org shows download speed of ~ 200-100KB/s . This seems unexpectedly slow. See screenshot.

screenshot from 2017-11-14 14-18-17

jhpoelen commented 6 years ago

@godfoder @mjcollin I've tried to use put job into hdfs and have mesos grab them from there. . . but alas -


I1114 17:29:20.400023  5195 fetcher.cpp:498] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/dabbc59d-84c3-4fb3-8baa-045c55e020fb-S8\/root","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"hdfs:\/\/\/guoda\/lib\/iDigBio-LD-assembly-1.5.8.jar"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/dabbc59d-84c3-4fb3-8baa-045c55e020fb-S8\/frameworks\/8d894434-c261-4d4f-9595-0ae41bac67eb-0345\/executors\/driver-20171114172920-58390\/runs\/9a97a40b-84b2-4e3c-afb7-fa42d13073b0","user":"root"}
I1114 17:29:20.402480  5195 fetcher.cpp:409] Fetching URI 'hdfs:///guoda/lib/iDigBio-LD-assembly-1.5.8.jar'
I1114 17:29:20.402496  5195 fetcher.cpp:250] Fetching directly into the sandbox directory
I1114 17:29:20.402516  5195 fetcher.cpp:187] Fetching URI 'hdfs:///guoda/lib/iDigBio-LD-assembly-1.5.8.jar'
E1114 17:29:20.704540  5195 shell.hpp:106] Command 'hadoop version 2>&1' failed; this is the output:
/usr/bin/hadoop: line 27: /usr/bin/../libexec/hadoop-config.sh: No such file or directory
/usr/bin/hadoop: line 166: exec: : not found
Failed to fetch 'hdfs:///guoda/lib/iDigBio-LD-assembly-1.5.8.jar': Failed to create HDFS client: Failed to execute 'hadoop version 2>&1'; the command was either not found or exited with a non-zero exit status: 127
Failed to synchronize with agent (it's probably exited)

any ideas?

mjcollin commented 6 years ago

Yes, the hadoop script expects an environment variable containing part of the program path to be set and handles it poorly when it's empty.

Source /etc/profile.d/hadoop.sh if possible or if not, here are the variables to set:

mjcollin@idb-jupyter1:~$ cat /etc/profile.d/hadoop.sh                                                                              
export HADOOP_PREFIX=/usr/lib/hadoop                                                                                               
export HADOOP_HOME=/usr/lib/hadoop                                                                                                 
export HADOOP_CONF_DIR=/etc/hadoop/conf                                                                                            
# The hadoop shell script is foolish about detecting this dir, set it manually                                                     
export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec                                                                                  
#Local GUODA settings                                                                                                              
export HADOOP_USER_NAME=hdfs

jhpoelen commented 6 years ago

Thanks for info. These are the settings I use for job to talk to hdfs after being launched (https://github.com/bio-guoda/effechecka/blob/56a09acf5352476fa4fa73503665fb0f7df455a0/src/main/scala/effechecka/SparkSubmitter.scala#L80). However, this particular issue needs the variables to be available prior to launching the spark job, in the "bare" mesos container on a mesos agent. I am assuming that @mjcollin @godfoder control these environment variables.

jhpoelen commented 6 years ago

For some reason, the download from amazon s3 are now functioning again.

Was there some acis ufl network issue?

screenshot from 2017-11-15 08-57-45

mjcollin commented 6 years ago

Not that I know of. It seems to me that this has happened before though, I remember looking at logs from a previous time when jobs were not finishing and seeing that the (I think) wget line was the last line when starting up the job. I had no trouble getting the jar on my desktop but I could see a number of attempted jobs starts on the cluster where it seemed to fail.

jhpoelen commented 6 years ago

Hmm. Perhaps there's some throttling happening on ufl or s3 side.

Any chance you can set hdfs environment variables on mesos agent machines? I don't think I can set these properties otherwise.

jhammock commented 6 years ago

For what the data is worth: I was able to upload a fresh copy of iNat this morning on the second try, and unzip it. I believe two attempts at makeParquet have failed.

jhpoelen commented 6 years ago

In can confirm that jobs are again stalling due to apparent download issues.

@mjcollin @godfoder please configure mesos agents for hdfs access.

screenshot from 2017-11-17 11-51-24

jhpoelen commented 6 years ago

For some miraculous reason, the most recent checklist job is running ok. Any idea what changed in the last 10 minutes?

last known fail

I1117 14:46:29.152925 27876 fetcher.cpp:498] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/dabbc59d-84c3-4fb3-8baa-045c55e020fb-S0\/root","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"https:\/\/s3-us-west-2.amazonaws.com\/guoda\/idigbio-spark\/iDigBio-LD-assembly-1.5.8.jar"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/dabbc59d-84c3-4fb3-8baa-045c55e020fb-S0\/frameworks\/8d894434-c261-4d4f-9595-0ae41bac67eb-0345\/executors\/driver-20171117144629-91700\/runs\/e9ee7ae9-2a7e-4767-81f7-2cbb700a3104","user":"root"}
I1117 14:46:29.155643 27876 fetcher.cpp:409] Fetching URI 'https://s3-us-west-2.amazonaws.com/guoda/idigbio-spark/iDigBio-LD-assembly-1.5.8.jar'
I1117 14:46:29.155660 27876 fetcher.cpp:250] Fetching directly into the sandbox directory
I1117 14:46:29.155689 27876 fetcher.cpp:187] Fetching URI 'https://s3-us-west-2.amazonaws.com/guoda/idigbio-spark/iDigBio-LD-assembly-1.5.8.jar'
I1117 14:46:29.155711 27876 fetcher.cpp:134] Downloading resource from 'https://s3-us-west-2.amazonaws.com/guoda/idigbio-spark/iDigBio-LD-assembly-1.5.8.jar' to '/var/lib/mesos/slaves/dabbc59d-84c3-4fb3-8baa-045c55e020fb-S0/frameworks/8d894434-c261-4d4f-9595-0ae41bac67eb-0345/executors/driver-20171117144629-91700/runs/e9ee7ae9-2a7e-4767-81f7-2cbb700a3104/iDigBio-LD-assembly-1.5.8.jar'

current success

[...]
I1117 14:51:31.551918 26058 fetcher.cpp:498] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/ea73f349-1248-4417-85dd-844276310f46-S0\/root","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"https:\/\/s3-us-west-2.amazonaws.com\/guoda\/idigbio-spark\/iDigBio-LD-assembly-1.5.8.jar"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/ea73f349-1248-4417-85dd-844276310f46-S0\/frameworks\/8d894434-c261-4d4f-9595-0ae41bac67eb-0345\/executors\/driver-20171117145131-91701\/runs\/572f634b-cf59-4f8f-b094-4d9aa0ae6513","user":"root"}
I1117 14:51:31.554407 26058 fetcher.cpp:409] Fetching URI 'https://s3-us-west-2.amazonaws.com/guoda/idigbio-spark/iDigBio-LD-assembly-1.5.8.jar'
I1117 14:51:31.554424 26058 fetcher.cpp:250] Fetching directly into the sandbox directory
I1117 14:51:31.554445 26058 fetcher.cpp:187] Fetching URI 'https://s3-us-west-2.amazonaws.com/guoda/idigbio-spark/iDigBio-LD-assembly-1.5.8.jar'
I1117 14:51:31.554460 26058 fetcher.cpp:134] Downloading resource from 'https://s3-us-west-2.amazonaws.com/guoda/idigbio-spark/iDigBio-LD-assembly-1.5.8.jar' to '/var/lib/mesos/slaves/ea73f349-1248-4417-85dd-844276310f46-S0/frameworks/8d894434-c261-4d4f-9595-0ae41bac67eb-0345/executors/driver-20171117145131-91701/runs/572f634b-cf59-4f8f-b094-4d9aa0ae6513/iDigBio-LD-assembly-1.5.8.jar'
W1117 14:52:19.419237 26058 fetcher.cpp:289] Copying instead of extracting resource from URI with 'extract' flag, because it does not seem to be an archive: https://s3-us-west-2.amazonaws.com/guoda/idigbio-spark/iDigBio-LD-assembly-1.5.8.jar
I1117 14:52:19.419298 26058 fetcher.cpp:547] Fetched 'https://s3-us-west-2.amazonaws.com/guoda/idigbio-spark/iDigBio-LD-assembly-1.5.8.jar' to '/var/lib/mesos/slaves/ea73f349-1248-4417-85dd-844276310f46-S0/frameworks/8d894434-c261-4d4f-9595-0ae41bac67eb-0345/executors/driver-20171117145131-91701/runs/572f634b-cf59-4f8f-b094-4d9aa0ae6513/iDigBio-LD-assembly-1.5.8.jar'
I1117 14:52:19.549391 26068 exec.cpp:161] Version: 1.0.0
I1117 14:52:19.550938 26071 exec.cpp:236] Executor registered on agent ea73f349-1248-4417-85dd-844276310f46-S0
Warning: Ignoring non-spark config property: _spark.eventLog.enabled=true
17/11/17 14:52:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-11-17 14:52:24,081:26086(0x7ff0935f0700):ZOO_INFO@log_env@726: Client environment:zookeeper.version=zookeeper C client 3.4.8
2017-11-17 14:52:24,081:26086(0x7ff0935f0700):ZOO_INFO@log_env@730: Client environment:host.name=mesos01
2017-11-17 14:52:24,081:26086(0x7ff0935f0700):ZOO_INFO@log_env@737: Client environment:os.name=Linux
2017-11-17 14:52:24,081:26086(0x7ff0935f0700):ZOO_INFO@log_env@738: Client environment:os.arch=4.4.0-87-generic
2017-11-17 14:52:24,081:26086(0x7ff0935f0700):ZOO_INFO@log_env@739: Client environment:os.version=#110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017
2017-11-17 14:52:24,081:26086(0x7ff0935f0700):ZOO_INFO@log_env@747: Client environment:user.name=(null)
2017-11-17 14:52:24,081:26086(0x7ff0935f0700):ZOO_INFO@log_env@755: Client environment:user.home=/root
2017-11-17 14:52:24,081:26086(0x7ff0935f0700):ZOO_INFO@log_env@767: Client environment:user.dir=/var/lib/mesos/slaves/ea73f349-1248-4417-85dd-844276310f46-S0/frameworks/8d894434-c261-4d4f-9595-0ae41bac67eb-0345/executors/driver-20171117145131-91701/runs/572f634b-cf59-4f8f-b094-4d9aa0ae6513
2017-11-17 14:52:24,081:26086(0x7ff0935f0700):ZOO_INFO@zookeeper_init@800: Initiating client connection, host=mesos02:2181,mesos01:2181,mesos03:2181 sessionTimeout=10000 watcher=0x7ff1a2697400 sessionId=0 sessionPasswd=<null> context=0x7ff054000998 flags=0
2017-11-17 14:52:24,083:26086(0x7ff0923e7700):ZOO_INFO@check_events@1728: initiated connection to server [10.13.44.15:2181]
I1117 14:52:24.102135 26220 sched.cpp:226] Version: 1.0.0
2017-11-17 14:52:24,118:26086(0x7ff0923e7700):ZOO_INFO@check_events@1775: session establishment complete on server [10.13.44.15:2181], sessionId=0x15f9d9055370068, negotiated timeout=10000
I1117 14:52:24.119119 26217 group.cpp:349] Group process (group(1)@10.13.44.15:45886) connected to ZooKeeper
I1117 14:52:24.119171 26217 group.cpp:837] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I1117 14:52:24.119189 26217 group.cpp:427] Trying to create path '/mesos' in ZooKeeper
I1117 14:52:24.120016 26215 detector.cpp:152] Detected a new leader: (id='1754')
I1117 14:52:24.120133 26206 group.cpp:706] Trying to get '/mesos/json.info_0000001754' in ZooKeeper
I1117 14:52:24.120646 26210 zookeeper.cpp:259] A new leading master (UPID=master@10.13.44.18:5050) is detected
I1117 14:52:24.120724 26208 sched.cpp:330] New master detected at master@10.13.44.18:5050
I1117 14:52:24.120877 26208 sched.cpp:341] No credentials provided. Attempting to register without authentication
I1117 14:52:24.122133 26209 sched.cpp:743] Framework registered with 8d894434-c261-4d4f-9595-0ae41bac67eb-0345-driver-20171117145131-91701
17/11/17 14:52:24 WARN MesosCoarseGrainedSchedulerBackend: Unable to parse  into a key:value label for the task.
17/11/17 14:52:24 WARN MesosCoarseGrainedSchedulerBackend: Unable to parse  into a key:value label for the task.
17/11/17 14:52:47 WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints (spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance.

[Stage 2:>                                                       (0 + 0) / 2232]
[...]

jhammock commented 6 years ago

update: I got three successful resource file updates today for fresh data. Feeling bold, I attempted an UpdateMonitors, which failed. I'll try that again tonight, hoping cluster traffic is a factor and perhaps it'll be lower then.

jhammock commented 6 years ago

update: yesterday's updateMonitors completed in 13 h 7 min. The cluster seemed pretty quiet otherwise.

jhpoelen commented 6 years ago

great! Closing issue.

bio-guoda / guoda-services

job fail apparently due to incomplete job jar download #37