Harness crashes 10 minutes after receiving two parallel training requests

Hello,

First of all, I would like to thank you for creating and maintaining this great open-source project :)
I really appreciate that you spend time reading and responding to the community feedback.

I've found a bug that can crash Harness by merely sending two requests to it at the same time.

The bug

Sending two parallel requests "train engine A" and "train engine B" to Harness causes it to fail both jobs and crash most of the time.
I've attached the full logs for the training job fail at [1] and logs for the Harness crash at [2].

Steps I've used to reproduce the crash:

Have two engines A and B (configured and populated with data)
Send requests "train engine A" and "train engine B" in parallel
The training should soon fail
Harness might error "Issue communicating with driver in heartbeater" 60 times over the next 10 minutes
After ~10 minutes Harness might shutdown - Executor logs "Exit as unable to send heartbeats to driver more than 60 times", exits and causes harness server shutdown

Warns in the chronological order:

copying from logs for clarity and for search engine indexation

04:10:02.047 WARN  SparkContext      - Another SparkContext is being constructed (or threw an exception in its constructor).  This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
04:10:02.076 WARN  SparkContext      - Another SparkContext is being constructed (or threw an exception in its constructor).  This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
04:10:02.112 WARN  Utils             - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
04:10:02.190 WARN  MetricsSystem     - Stopping a MetricsSystem that is not running
04:10:02.197 WARN  Utils             - Service 'sparkDriver' could not bind on port 40761. Attempting port 40762.
04:10:02.298 WARN  SparkSession$Builder - Using an existing SparkSession; some configuration may not take effect.
04:10:39.015 WARN  BlockManager      - Block broadcast_0 already exists on this machine; not re-adding it
04:10:39.058 WARN  BlockManager      - Block broadcast_0_piece0 already exists on this machine; not re-adding it
04:10:40.184 WARN  BlockManager      - Block broadcast_1 already exists on this machine; not re-adding it
04:10:40.190 WARN  BlockManager      - Block broadcast_1_piece0 already exists on this machine; not re-adding it
04:10:40.236 WARN  TaskSetManager    - Lost task 0.0 in stage 65.0 (TID 37, localhost, executor driver): java.lang.ClassCastException: com.actionml.core.spark.GenericMongoConnector cannot be cast to org.apache.mahout.math.indexeddataset.BiMap
04:10:42.653 WARN  NettyRpcEnv       - Ignored failure: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@191a0b14 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@17dde647[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
04:10:42.654 WARN  Executor          - Issue communicating with driver in heartbeater
04:10:52.651 WARN  NettyRpcEnv       - Ignored failure: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@671d80a0 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@17dde647[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
04:10:52.651 WARN  Executor          - Issue communicating with driver in heartbeater

Errors in the chronological order:

04:10:02.172 ERROR SparkContext      - Error initializing SparkContext.
04:10:02.184 ERROR Utils             - Uncaught exception in thread ForkJoinPool-2-worker-3
04:10:02.193 ERROR SparkContextSupport$ - Spark context can not be created for job JobDescription(13da4396-a9ca-4f95-919a-783e851a04dc,queued,Spark job,Some(Thu Sep 23 04:10:02 CEST 2021),None)
04:10:40.221 ERROR Executor          - Exception in task 0.0 in stage 65.0 (TID 37)
04:10:40.237 ERROR TaskSetManager    - Task 0 in stage 65.0 failed 1 times; aborting job
04:10:40.244 ERROR URAlgorithm       - Spark computation failed for engine 3c90acb6-2c2f-42ac-8b53-7c955d0912ef with params <engine config JSON, embedded at [3]>

It's worth mentioning that I know that isn't the proper use-case for scheduling training jobs.
I've found the issue by a cron configuration mistake - I accidentally scheduled both training jobs for the same time, instead of spacing them out.
However this simple mistake has caused entire Harness to crash - hence I reported the issue, as this is the first time Harness crashed on me.

Machine configuration

I'm running a Harness instance on a Ubuntu 18.04.2 LTS machine with 16 GB of RAM and 2 GB of SWAP.
At the moment all of the services are running on it, including Spark. Software versions:

MongoDB v4.0.23
ElasticSearch v5.6.16
Harness v0.6.2, cloned from master branch (commit 91f937e5).

The RAM is sufficient for Spark to execute - there's over 9 GB available, Spark executor memory and driver memory are 4 GB each, and the dataset isn't too large for Spark.
Every training job I execute completes succesfully, except for the described case.
I've tested my deployment specifically to ensure that the RAM isn't the issue.

Engine configuration

The affected engine configuration can be found at [3]. The other engine's configuration is exactly the same, except for a different engine ID.
There's around 700k-1000k events per engine, over 95% of which are buy and enter_product. The TTL is set to a year, but the data spans over only the past 3,5 months.
The engines work fine - they properly store events and return recommendations.

Harness usage

Harness is serving recommendations and storing events for an e-commerce store. The training jobs were scheduled at 4 AM due to the lowest traffic being at that hour.

Links to the logs

[1]. Logs for the training: https://gist.github.com/asdfMaciej/eb4e13de903e4f35892af7b0f6a6c6f9
[2]. Logs for the Harness shutdown: https://gist.github.com/asdfMaciej/c903d82d3e97011cfe4c6e748953134c
[3]. Engine config: https://gist.github.com/asdfMaciej/dc116e20f6827d0a591ab7d68a05dfa8

Let me know if you need more information about this issue.
Thanks in advance for your help and have a great day :)

Best regards,
Maciej Kaszkowiak

actionml / harness