Closed asdfMaciej closed 3 years ago
Hi Maciej, Thanks for your feedback and very good description.
master: local for both engines mean that two local SparkContexts were created. Spark does not support this. You can use standalone or clustered spark to train engines in parallel - https://spark.apache.org/docs/2.3.3/spark-standalone.html
Hello,
First of all, I would like to thank you for creating and maintaining this great open-source project :)
I really appreciate that you spend time reading and responding to the community feedback.
I've found a bug that can crash Harness by merely sending two requests to it at the same time.
The bug
Sending two parallel requests "train engine A" and "train engine B" to Harness causes it to fail both jobs and crash most of the time.
I've attached the full logs for the training job fail at [1] and logs for the Harness crash at [2].
Steps I've used to reproduce the crash:
Warns in the chronological order:
copying from logs for clarity and for search engine indexation
Errors in the chronological order:
It's worth mentioning that I know that isn't the proper use-case for scheduling training jobs.
I've found the issue by a cron configuration mistake - I accidentally scheduled both training jobs for the same time, instead of spacing them out.
However this simple mistake has caused entire Harness to crash - hence I reported the issue, as this is the first time Harness crashed on me.
Machine configuration
I'm running a Harness instance on a Ubuntu 18.04.2 LTS machine with 16 GB of RAM and 2 GB of SWAP.
At the moment all of the services are running on it, including Spark. Software versions:
The RAM is sufficient for Spark to execute - there's over 9 GB available, Spark executor memory and driver memory are 4 GB each, and the dataset isn't too large for Spark.
Every training job I execute completes succesfully, except for the described case.
I've tested my deployment specifically to ensure that the RAM isn't the issue.
Engine configuration
The affected engine configuration can be found at [3]. The other engine's configuration is exactly the same, except for a different engine ID.
There's around 700k-1000k events per engine, over 95% of which are buy and enter_product. The TTL is set to a year, but the data spans over only the past 3,5 months.
The engines work fine - they properly store events and return recommendations.
Harness usage
Harness is serving recommendations and storing events for an e-commerce store. The training jobs were scheduled at 4 AM due to the lowest traffic being at that hour.
Links to the logs
[1]. Logs for the training: https://gist.github.com/asdfMaciej/eb4e13de903e4f35892af7b0f6a6c6f9
[2]. Logs for the Harness shutdown: https://gist.github.com/asdfMaciej/c903d82d3e97011cfe4c6e748953134c
[3]. Engine config: https://gist.github.com/asdfMaciej/dc116e20f6827d0a591ab7d68a05dfa8
Let me know if you need more information about this issue.
Thanks in advance for your help and have a great day :)
Best regards,
Maciej Kaszkowiak