Azure-Samples / jmeter-aci-terraform

Scalable cloud load/stress testing pipeline solution with Apache JMeter and Terraform to dynamically provision and destroy the required infrastructure on Azure.
MIT License
119 stars 99 forks source link

Controller Container Fails To Start, Failure is Hidden in Pipeline #79

Closed dagrooms52 closed 2 years ago

dagrooms52 commented 2 years ago

I've had issues with the jmeter-controller container failing to start from this pipeline intermittently.

There are no container logs to be pulled during 'TEST: Wait Test Execution', so this step of the pipeline exits immediately with a success code. It may be better to fail at this point in the pipeline by checking if the container is running at the end of SETUP: Run Terraform Apply (target=all).

As far as I can tell this is just an issue with Jmeter's distributed worker reliability, there are no code changes between succeeding & failing runs except for a change in the WORKER_COUNT variable in the jmeter file.

dagrooms52 commented 2 years ago

I've seen this same error from the controller container multiple times during this failure, it is failing to connect to the worker[0] instance which always has IP 10.0.0.4

START Running Jmeter on Mon Sep 27 16:31:37 UTC 2021
JVM_ARGS=-Xmn2908m -Xms11632m -Xmx11632m
jmeter args=-n -J server.rmi.ssl.disable=true -t sample.jmx -l results.jtl -e -o dashboard -R 10.0.0.5,10.0.0.4
Sep 27, 2021 4:31:39 PM java.util.prefs.FileSystemPreferences$1 run
INFO: Created user preferences directory.
Creating summariser <summary>
Created the tree successfully using sample.jmx
Configuring remote engine: 10.0.0.5
Configuring remote engine: 10.0.0.4
Connection refused to host: 10.0.0.4; nested exception is: 
    java.net.ConnectException: Connection refused (Connection refused)
Failed to configure 10.0.0.4
Stopping remote engines
Remote engines have been stopped
Error in NonGUIDriver java.lang.RuntimeException: Following remote engines could not be configured:[10.0.0.4]
END Running Jmeter on Mon Sep 27 16:31:40 UTC 2021
devlie commented 2 years ago

I'm going to bet it's the same problem as issue #78 I just reported.

The reason is actually due to worker container failing to start JMeter, and because worker doesn't have restart_policy specified, ACI tries to restart it couple times before giving up which wipes the console log. I was able to catch it by setting restart_policy to Never like controller, and watching the ACI console as it runs.

I have no idea how to resolve it yet though.

dagrooms52 commented 2 years ago

I'm not sure, I haven't seen these containers come up with a loopback address (127.0.0.1:37683 in your issue). They seem to start with the correct IP, yet the controller isn't able to contact them. Thanks for the tip on restart_policy though, I will set that to Never for workers so I can grab logs and compare to your info.

devlie commented 2 years ago

The worker container actually starts with the right IP, but for some reason the heuristics that JMeter employs isn't able to resolve it. You can also verify by skipping cleanup step on failure, and just manually try to restart the worker container.

dagrooms52 commented 2 years ago

You're right, I was able to stop the containers and caught 3 of them reproducing this when deploying 20.

START Running Jmeter on Tue Sep 28 00:04:40 UTC 2021
JVM_ARGS=-Xmn1572m -Xms6288m -Xmx6288m
jmeter args=-s -J server.rmi.ssl.disable=true
Sep 28, 2021 12:04:42 AM java.util.prefs.FileSystemPreferences$1 run
INFO: Created user preferences directory.
Created remote object: UnicastServerRef2 [liveRef: [endpoint:[127.0.0.1:42101](local),objID:[10877288:17c29b7c631:-7fff, -8508851859447852716]]]
Server failed to start: java.rmi.RemoteException: Cannot start. SandboxHost-637683842593770648 is a loopback address.
An error occurred: Cannot start. SandboxHost-637683842593770648 is a loopback address.
dagrooms52 commented 2 years ago

See #78 for a possible workaround.