fs-drift tests hanging - Githubissues

rsevilla87 commented 2 years ago

fs-drift e2e tests are haging with the following traceback

2022-03-09T23:11:23Z - DEBUG    - MainProcess - run_snafu: Run ID is {'NA'}
2022-03-09T23:11:23Z - DEBUG    - MainProcess - run_snafu: document size is: 240
2022-03-09T23:11:23Z - DEBUG    - MainProcess - run_snafu: {
    "_index": "ripsaw-fs-drift-rsptimes",
    "_op_type": "create",
    "_source": {
        "op-count": 8369,
        "cluster_name": "test_ci",
        "uuid": "566d1083-c8f1-5930-8e48-a4b0b6a6c6db",
        "user": "homer_simpson",
        "sample": 1,
        "date": "2022-03-09T23:11:20.000Z",
        "file-ops-per-sec": 8369.0,
        "min": 0.000138,
        "max": 0.035008,
        "mean": 0.000589,
        "50%": 0.000257,
        "90%": 0.001502,
        "95%": 0.002365,
        "99%": 0.005343,
        "run_id": "NA"
    },
    "_id": "3977bfd47efa0216f63ab635ef4831a246d305bc96ee065525b17e5fd204c525",
    "run_id": "NA"
}
2022-03-09T23:11:23Z - INFO     - MainProcess - trigger_fs_drift: process 38 records from rates-over-time file counters.00.fs-drift-hostpath-benchmark-client-566d1083-1-6tck5.json 
Traceback (most recent call last):
  File "/usr/local/bin/run_snafu", line 33, in <module>
    sys.exit(load_entry_point('snafu', 'console_scripts', 'run_snafu')())
  File "/opt/snafu/snafu/run_snafu.py", line 142, in main
    es, process_generator(index_args, parser), parallel_setting
  File "/opt/snafu/snafu/utils/py_es_bulk.py", line 172, in streaming_bulk
    for ok, resp_payload in streaming_bulk_generator:
  File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 320, in streaming_bulk
    actions, chunk_size, max_chunk_bytes, client.transport.serializer
  File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 155, in _chunk_actions
    for action, data in actions:
  File "/opt/snafu/snafu/utils/py_es_bulk.py", line 118, in actions_tracking_closure
    for cl_action in cl_actions:
  File "/opt/snafu/snafu/run_snafu.py", line 199, in process_generator
    for action, index in data_object.emit_actions():
  File "/opt/snafu/snafu/fs_drift_wrapper/trigger_fs_drift.py", line 170, in emit_actions
    json_obj = json.loads(json_str)
  File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python3.6/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 761 (char 760)
+ '[' 1 = 0 ']'
+ echo FAIL
FAIL

More info and logs at https://github.com/cloud-bulldozer/benchmark-operator/actions/runs/1960169819

Maybe a recent change in fs-drift @bengland2?

bengland2 commented 2 years ago

yes fs-drift has changed, a lot. Sam Petrovic in Brno has merged his version of it with mine, so perhaps we broke benchmark-operator. I'll get right on this tomorrow AM.

spetrovi commented 2 years ago

Hey, yeah fs-drift went through a lot of changes recently, so we expect some stuff breaking here and there, unfortunately. However, this bug is pretty peculiar. I can see the code is failing when trying to parse counters json outputed by fs-drift, but honestly, I didn't touch this output generator much. Also, when I run fs-drift on my machine, it produces valid jsons. However, when I try to parse it by the procedure from trigger_fs_drift.py, I get the same error. Copied from trigger_fs_drift.py:

import json
import os

rsptime_dir = '/tmp/mydir/network-shared/'
for fn in os.listdir(rsptime_dir):
    if fn.startswith("counters") and fn.endswith("json"):
                pathnm = os.path.join(rsptime_dir, fn)
                with open(pathnm, "r") as f:
                    records = [line.strip() for line in f.readlines()]
                json_start = 0
                for index, record in enumerate(records):
                    print(record)
                    if record == "{":
                        json_start = index
                    if record == "}{" or record == "}":
                        # extract next JSON string from counter logfile
                        json_str = " ".join(records[json_start:index])
                        json_str += " }"
                        if record == "}{":
                            records[index] = "{"
                        json_start = index
                        json_obj = json.loads(json_str)

The culprit is in the building of json_str. Here is a valid json produced by fs-drift:

[{
    "created": 1141,
    "deleted": 122,
    "softlinked": 14,
    "hardlinked": 102,
    "appended": 7132,
    "randomly_written": 1169,
    "sequentially_read": 2376,
    "randomly_read": 1234,
    "renamed": 1198,
    "truncated": 67,
    "remounted": 0,
    "readdir": 138,
    "read_requests": 2376,
    "read_bytes": 8789830,
    "randread_requests": 1233,
    "randread_bytes": 2431366,
    "write_requests": 8273,
    "write_bytes": 17014543,
    "randwrite_requests": 1169,
    "randwrite_bytes": 2233762,
    "randdiscard_requests": 0,
    "randdiscard_bytes": 0,
    "fsyncs": 1942,
    "fdatasyncs": 941,
    "dirs_created": 0,
    "e_already_exists": 4885,
    "e_file_not_found": 3519,
    "e_no_dir_space": 0,
    "e_no_inode_space": 0,
    "e_no_space": 0,
    "e_not_mounted": 0,
    "e_could_not_unmount": 0,
    "e_stale_fh": 0,
    "e_could_not_mount": 0,
    "e_dir_not_found": 0,
    "elapsed-time": "      5.0",
    "total-errors": "        0"
}
]

So what happens is, in the innermost for block, it cycles through all the lines up until the second to last one (right curly bracket), which passes the conditional, then the code joins the lines to form a json_str and slaps right curly bracket on the end, resulting json_str contains this string:

[{
    "created": 1141,
    "deleted": 122,
    "softlinked": 14,
    "hardlinked": 102,
    "appended": 7132,
    "randomly_written": 1169,
    "sequentially_read": 2376,
    "randomly_read": 1234,
    "renamed": 1198,
    "truncated": 67,
    "remounted": 0,
    "readdir": 138,
    "read_requests": 2376,
    "read_bytes": 8789830,
    "randread_requests": 1233,
    "randread_bytes": 2431366,
    "write_requests": 8273,
    "write_bytes": 17014543,
    "randwrite_requests": 1169,
    "randwrite_bytes": 2233762,
    "randdiscard_requests": 0,
    "randdiscard_bytes": 0,
    "fsyncs": 1942,
    "fdatasyncs": 941,
    "dirs_created": 0,
    "e_already_exists": 4885,
    "e_file_not_found": 3519,
    "e_no_dir_space": 0,
    "e_no_inode_space": 0,
    "e_no_space": 0,
    "e_not_mounted": 0,
    "e_could_not_unmount": 0,
    "e_stale_fh": 0,
    "e_could_not_mount": 0,
    "e_dir_not_found": 0,
    "elapsed-time": "      5.0",
    "total-errors": "        0"
}

Which is invalid, because there is no closing square bracket and the JSON decoder gets confused and expects another dict (separated by ','). Unfortunatelly, there is no debug to be done on fs-drift side to remedy this. Instead, I'd recommend to just pass the whole file contents to the JSON decoder and work with the resulting structure (list of dicts).

Hope I helped, Sam

bengland2 commented 2 years ago

thank you for the analysis Sam, I'll take a look at the trigger module and see why this happened.

bengland2 commented 2 years ago

@spetrovi this is worse than I thought, I absolutely should load the counter data with json.load() and then generate separate docs for each counter snapshot, much easier that way. Coding that up now.

bengland2 commented 2 years ago

@rsevilla87 benchmark-wrapper PR 414 has merged, this should resolve this issue. Let me know if it doesn't. quay.io build has been triggered but has not yet completed.

rsevilla87 commented 2 years ago

Hey @bengland2, the CI test failed again, the error now seems unrelated to fs-drift though. I've seen this error before, and it was caused by an update of the elasticsearch python client version (latest versions are not compatible with AWS elasticsearch). However this version is fixed in setup.cfg so it's something we'll have to investigate deeper. https://github.com/cloud-bulldozer/benchmark-wrapper/blob/master/setup.cfg#L24

[pod/fs-drift-benchmark-client-3f5a0b88-1-5xlqb/fs-drift] RUN STATUS DONE
[pod/fs-drift-benchmark-client-3f5a0b88-1-5xlqb/backpack] Input dictionary is empty for module ocp_default_ingress_controller
[pod/fs-drift-benchmark-client-3f5a0b88-1-5xlqb/backpack] Input dictionary is empty for module ocp_dns
[pod/fs-drift-benchmark-client-3f5a0b88-1-5xlqb/backpack] Input dictionary is empty for module ocp_kube_apiserver
[pod/fs-drift-benchmark-client-3f5a0b88-1-5xlqb/backpack] Input dictionary is empty for module ocp_kube_controllermanager
[pod/fs-drift-benchmark-client-3f5a0b88-1-5xlqb/backpack] Input dictionary is empty for module ocp_network_operator
[pod/fs-drift-benchmark-client-3f5a0b88-1-5xlqb/backpack] Unknown indexing error: The client noticed that the server is not a supported distribution of Elasticsearch
[pod/fs-drift-benchmark-client-3f5a0b88-1-5xlqb/backpack] Indexing exception found The client noticed that the server is not a supported distribution of Elasticsearch
[pod/fs-drift-benchmark-client-3f5a0b88-1-5xlqb/backpack] Closing Redis connection
[pod/fs-drift-benchmark-client-3f5a0b88-1-5xlqb/backpack] Attempting to close ES connection

bengland2 commented 2 years ago

maybe the CI test script is targeting the wrong ES server? Will investigate.

bengland2 commented 2 years ago

note that even backpack fails to log data, so this really has nothing to do with fs-drift per se, but it may be related to how the container image is built. If we are going to log this error, it would be nice if it logged the client and server ES version encountered?

[bengland@localhost ripsaw]$ oc logs -f fs-drift-cephfs-client-3b4ffc0f-1--1-qvmgn -c backpack Input dictionary is empty for module ocp_default_ingress_controller Input dictionary is empty for module ocp_dns Input dictionary is empty for module ocp_kube_apiserver Input dictionary is empty for module ocp_kube_controllermanager Input dictionary is empty for module ocp_network_operator Unknown indexing error: The client noticed that the server is not a supported distribution of Elasticsearch Indexing exception found The client noticed that the server is not a supported distribution of Elasticsearch Closing Redis connection Attempting to close ES connection uuid: 3b4ffc0f-7430-53c6-a683-760677e4b3c8

bengland2 commented 2 years ago

@rsevilla87 fs-drift itself worked! and here is the server ES version. I expect with the fix you just added that it should work for backpack also.

2022-03-21T14:35:43Z - INFO - MainProcess - run_snafu: Using index prefix for ES: ripsaw-fs-drift 2022-03-21T14:35:43Z - INFO - MainProcess - run_snafu: Connected to the elasticsearch cluster with info as follows: 2022-03-21T14:35:43Z - INFO - MainProcess - run_snafu: { "name": "5619b1dd33c821625f405feab5045dcd", "cluster_name": "415909267177:perfscale-dev", "cluster_uuid": "Xz2IU4etSieAeaO2j-QCUw", "version": { "number": "7.10.2", "build_flavor": "oss", "build_type": "tar", "build_hash": "unknown", "build_date": "2021-09-29T11:42:59.634166Z", "build_snapshot": false, "lucene_version": "8.7.0", "minimum_wire_compatibility_version": "6.8.0", "minimum_index_compatibility_version": "6.0.0-beta1" }, "tagline": "You Know, for Search" } ... 2022-03-21T14:40:51Z - INFO - MainProcess - trigger_fs_drift: process 149 intervals from rates-over-time file counters.01.fs-drift-cephfs-client-3b4ffc0f-1--1-qvmgn.json 2022-03-21T14:40:51Z - INFO - MainProcess - run_snafu: Indexed results - 450 success, 0 duplicates, 0 failures, with 0 retries. 2022-03-21T14:40:51Z - INFO - MainProcess - run_snafu: Duration of execution - 0:05:08, with total size of 108000 bytes

rsevilla87 commented 2 years ago

Oh right!, I didn't initially realize the hanging container was backpack. I think we can close this issue :)

rsevilla87 commented 2 years ago

@bengland2, after re-running fs-drift tests in CI I've seen these messages

    fs_drift-hostpath2/2 in 114 sec ✗ fs_drift-hostpath [112839]
   (from function `kubectl_exec' in file ./helpers.bash, line 91,
    from function `die' in file ./helpers.bash, line 80,
    from function `check_es' in file ./helpers.bash, line 24,
    in test file ./017-fs_drift.bats, line 26)
     `check_es' failed
   benchmark.ripsaw.cloudbulldozer.io/fs-drift-hostpath-benchmark created
   Waiting for UUID from fs-drift-hostpath-benchmark
   Looking for documents with uuid: ddd501e7-b32d-55b6-8d62-d63d8a7672f7 in index ripsaw-fs-drift-results
   Looking for documents with uuid: ddd501e7-b32d-55b6-8d62-d63d8a7672f7 in index ripsaw-fs-drift-rsptimes

   Error message: 0 documents found in index ripsaw-fs-drift-rsptimes

Seems like no documents were found in the ripsaw-fs-drift-rsptimes index, are we still using it?

bengland2 commented 2 years ago

@rsevilla87 yes I'm still using it but if the test isn't long enough, then nothing gets put in there and the check_es fails, my bad. I'll figure out what to do with that.

rsevilla87 commented 2 years ago

@rsevilla87 yes I'm still using it but if the test isn't long enough, then nothing gets put in there and the check_es fails, my bad. I'll figure out what to do with that.

We can maybe increase the duration in both test cases, don't we? https://github.com/cloud-bulldozer/benchmark-operator/tree/master/e2e/fs_drift

rsevilla87 commented 2 years ago

Hey @bengland2, the tests under the tests directory are not used anymore in benchmark-operator, they are actually used by benchmark-wrapper. I'll update the correct ones, which are under the e2e directory

bengland2 commented 2 years ago

Oh, I had no idea that had changed, thanks @rsevilla87

bengland2 commented 2 years ago

Dang, I figured it out, if you run for less than 240 seconds it decides to skip response time processing due to not enough samples and that's why the CI test fails, because there will be no documents in ripsaw-fs-drift-rsptimes then. Changing my PR accordingly.

cloud-bulldozer / benchmark-operator

fs-drift tests hanging #741