cloud-bulldozer / benchmark-wrapper

Python Library to run benchmarks
https://benchmark-wrapper.readthedocs.io
Apache License 2.0
19 stars 56 forks source link

run_snafu does not protect against bad JSON data in certain failures #372

Open RobertKrawitz opened 2 years ago

RobertKrawitz commented 2 years ago

Running hammerdb-mssql on kata, we've seen occasional errors indicating parse failure on JSON output. The pod then errors out and there's no way to capture the bad JSON.

IMO it should catch the exception and print a more useful diagnostic, perhaps the bad JSON data.

[root@perf-sm5039-3-1 benchmark-runner]# oc logs hammerdb-kata-workload-27917bd1--1-swjfq
2021-11-05T19:16:05Z - INFO     - MainProcess - run_snafu: logging level is INFO
2021-11-05T19:16:05Z - INFO     - MainProcess - _load_benchmarks: Successfully imported 3 benchmark modules: coremarkpro, systemd_analyze, uperf
2021-11-05T19:16:05Z - INFO     - MainProcess - _load_benchmarks: Failed to import 0 benchmark modules: 
2021-11-05T19:16:05Z - INFO     - MainProcess - run_snafu: Using elasticsearch server with host: http://10.1.184.179:9200
2021-11-05T19:16:05Z - INFO     - MainProcess - run_snafu: Using index prefix for ES: hammerdb-test-ci
2021-11-05T19:16:05Z - INFO     - MainProcess - run_snafu: Connected to the elasticsearch cluster with info as follows:
/usr/local/lib/python3.6/site-packages/elasticsearch/connection/base.py:208: ElasticsearchWarning: Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.14/security-minimal-setup.html to enable security.
  warnings.warn(message, category=ElasticsearchWarning)
2021-11-05T19:16:05Z - INFO     - MainProcess - run_snafu: {
    "name": "894e68c1af5d",
    "cluster_name": "docker-cluster",
    "cluster_uuid": "TVGTUkgJStG0xhcZNXGx3w",
    "version": {
        "number": "7.14.0",
        "build_flavor": "default",
        "build_type": "docker",
        "build_hash": "dd5a0a2acaa2045ff9624f3729fc8a6f40835aa1",
        "build_date": "2021-07-29T20:49:32.864135063Z",
        "build_snapshot": false,
        "lucene_version": "8.9.0",
        "minimum_wire_compatibility_version": "6.8.0",
        "minimum_index_compatibility_version": "6.0.0-beta1"
    },
    "tagline": "You Know, for Search"
}
2021-11-05T19:16:05Z - INFO     - MainProcess - py_es_bulk: Using streaming bulk indexer
2021-11-05T19:16:05Z - INFO     - MainProcess - wrapper_factory: identified hammerdb as the benchmark wrapper
2021-11-05T19:16:05Z - INFO     - MainProcess - trigger_hammerdb: Starting hammerdb run
2021-11-05T19:16:53Z - INFO     - MainProcess - trigger_hammerdb: Parsing stdout
2021-11-05T19:16:53Z - INFO     - MainProcess - trigger_hammerdb: generating json payload
Traceback (most recent call last):
  File "/usr/local/bin/run_snafu", line 33, in <module>
    sys.exit(load_entry_point('snafu', 'console_scripts', 'run_snafu')())
  File "/opt/snafu/snafu/run_snafu.py", line 142, in main
    es, process_generator(index_args, parser), parallel_setting
  File "/opt/snafu/snafu/utils/py_es_bulk.py", line 172, in streaming_bulk
    for ok, resp_payload in streaming_bulk_generator:
  File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 320, in streaming_bulk
    actions, chunk_size, max_chunk_bytes, client.transport.serializer
  File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 155, in _chunk_actions
    for action, data in actions:
  File "/opt/snafu/snafu/utils/py_es_bulk.py", line 118, in actions_tracking_closure
    for cl_action in cl_actions:
  File "/opt/snafu/snafu/run_snafu.py", line 199, in process_generator
    for action, index in data_object.emit_actions():
  File "/opt/snafu/snafu/hammerdb/trigger_hammerdb.py", line 284, in emit_actions
    timestamp,
  File "/opt/snafu/snafu/hammerdb/trigger_hammerdb.py", line 178, in _json_payload
    "worker": data[i][0],
IndexError: list index out of range
RobertKrawitz commented 2 years ago

@ebattat

RobertKrawitz commented 2 years ago

I have a bit more informatoin on this. It appears that it was triggered when the backend mssql database failed. I copied strace into the workload pod and found this:

sh-4.4$ /var/tmp/strace -s 65536 -f -p 6
/var/tmp/strace: Process 6 attached
read(4, "\rError in Virtual User 1: Connection to DRIVER=ODBC Driver 17 for SQL Server;SERVER=tcp:mssql-deployment.mssql-db,1433;UID=SA;PWD=XXXXXXXXXXX could not be established : [Microsoft][ODBC Driver 17 for SQL Server]Login timeout expired\n[Microsoft][ODBC Driver 17 for SQL Server]TCP Provider: Error code 0x2749\n[Microsoft][ODBC Driver 17 for SQL Server]A network-related or instance-specific error has occurred while establishing a connection to SQL Server. Server is not found or not accessible. Check if instance name is correct and if SQL Server is configured to allow remote connections. For more information see SQL Server Books Online.\n(connecting to database)\n", 3283) = 672
read(4, "\rVuser 1:FINISHED FAILED\n", 2611) = 25
read(4, "\rError in Virtual User 2: Connection to DRIVER=ODBC Driver 17 for SQL Server;SERVER=tcp:mssql-deployment.mssql-db,1433;UID=SA;PWD=XXXXXXXXXXX could not be established : [Microsoft][ODBC Driver 17 for SQL Server]Login timeout expired\n[Microsoft][ODBC Driver 17 for SQL Server]TCP Provider: Error code 0x2749\n[Microsoft][ODBC Driver 17 for SQL Server]A network-related or instance-specific error has occurred while establishing a connection to SQL Server. Server is not found or not accessible. Check if instance name is correct and if SQL Server is configured to allow remote connections. For more information see SQL Server Books Online.\n(connecting to database)\n", 2586) = 672
read(4, "\rVuser 2:FINISHED FAILED\n", 1914) = 25
read(4, "\rError in Virtual User 3: Connection to DRIVER=ODBC Driver 17 for SQL Server;SERVER=tcp:mssql-deployment.mssql-db,1433;UID=SA;PWD=XXXXXXXXXXX could not be established : [Microsoft][ODBC Driver 17 for SQL Server]Login timeout expired\n[Microsoft][ODBC Driver 17 for SQL Server]TCP Provider: Error code 0x2749\n[Microsoft][ODBC Driver 17 for SQL Server]A network-related or instance-specific error has occurred while establishing a connection to SQL Server. Server is not found or not accessible. Check if instance name is correct and if SQL Server is configured to allow remote connections. For more information see SQL Server Books Online.\n(connecting to database)\n", 1889) = 672

I can't find the resulting JSON in the strace log.