MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

Killing Livy/Spark Job #248

Closed ghukill closed 6 years ago

ghukill commented 6 years ago

Moving into Spark Cluster, might need additional calls to kill a Job.

Until now, this was sufficient to stop Job:

LivyClient().stop_job(instance.url)

However, that same command now shows statement in Livy UI as cancelled, but the job in Spark Application continues. There is a bit of a disconnect from Livy statement to Spark Application Job.

However, we know the JobGroup and can kill it from the Spark application with this URL:

http://example:4040/jobs/job/kill/?id=7

This returns:

18/07/18 13:13:44 INFO utils.LineBufferedStream: stdout: 18/07/18 13:13:44 INFO scheduler.DAGScheduler: Asked to cancel job 7
18/07/18 13:13:44 INFO utils.LineBufferedStream: stdout: 18/07/18 13:13:44 INFO scheduler.TaskSchedulerImpl: Cancelling stage 14
18/07/18 13:13:44 INFO utils.LineBufferedStream: stdout: 18/07/18 13:13:44 INFO scheduler.TaskSchedulerImpl: Stage 14 was cancelled
18/07/18 13:13:44 INFO utils.LineBufferedStream: stdout: 18/07/18 13:13:44 INFO scheduler.DAGScheduler: ShuffleMapStage 14 (jdbc at NativeMethodAccessorImpl.java:0) failed in 338.183 s due to Job 7 cancelled 
18/07/18 13:13:44 INFO utils.LineBufferedStream: stdout: 18/07/18 13:13:44 INFO scheduler.DAGScheduler: Job 7 failed: jdbc at NativeMethodAccessorImpl.java:0, took 338.241072 s
18/07/18 13:13:44 INFO utils.LineBufferedStream: stdout: 18/07/18 13:13:44 INFO storage.BlockManagerInfo: Removed broadcast_11_piece0 on example:35927 in memory (size: 2.1 KB, free: 2.2 GB)
18/07/18 13:13:44 INFO utils.LineBufferedStream: stdout: 18/07/18 13:13:44 INFO storage.BlockManagerInfo: Removed broadcast_11_piece0 on example:36209 in memory (size: 2.1 KB, free: 2004.6 MB)
18/07/18 13:13:44 INFO utils.LineBufferedStream: stdout: 18/07/18 13:13:44 INFO spark.ContextCleaner: Cleaned accumulator 13856
18/07/18 13:13:46 INFO utils.LineBufferedStream: stdout: 18/07/18 13:13:46 WARN scheduler.TaskSetManager: Lost task 3.0 in stage 14.0 (TID 570, example, executor 0): TaskKilled (killed intentionally)
18/07/18 13:13:46 INFO utils.LineBufferedStream: stdout: 18/07/18 13:13:46 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 14.0, whose tasks have all completed, from pool 

But even then, the Job continues. The only surefire way is to kill Livy Sesssion / Spark Application. Is this reason enough to revert back to local[*] mode vs. standalone cluster?

ghukill commented 6 years ago

FWIW, running Livy in local[*] or local does not help. Maintaining cluster.

ghukill commented 6 years ago

Building on this, would be beneficial to have ability to stop Job, in addition to deleting which also should stop Job.

ghukill commented 6 years ago

Re: stopping Spark Job, might be more effective to stop in Spark application as opposed to canceling statement from Livy.

Using Job.get_spark_jobs, get a response like this:

[{'duration': 71,
  'duration_s': '0:01:11',
  'jobGroup': '842',
  'jobId': 20,
  'name': 'foreachPartition at MongoSpark.scala:117',
  'numActiveStages': 1,
  'numActiveTasks': 1,
  'numCompletedStages': 1,
  'numCompletedTasks': 11,
  'numFailedStages': 0,
  'numFailedTasks': 0,
  'numSkippedStages': 0,
  'numSkippedTasks': 0,
  'numTasks': 412,
  'stageIds': [33, 34, 35, 32],
  'status': 'RUNNING',
  'submissionTime': '2018-10-03T14:16:06.308GMT'}]

Using the jobId attribute, can "kill" the Job in Spark app using the following URL pattern, using the jobid from the output above:

http://HOST:4040/jobs/job/kill/?id=22
ghukill commented 6 years ago

Done.