Closed rsevilla87 closed 2 years ago
yes fs-drift has changed, a lot. Sam Petrovic in Brno has merged his version of it with mine, so perhaps we broke benchmark-operator. I'll get right on this tomorrow AM.
Hey, yeah fs-drift went through a lot of changes recently, so we expect some stuff breaking here and there, unfortunately. However, this bug is pretty peculiar. I can see the code is failing when trying to parse counters json outputed by fs-drift, but honestly, I didn't touch this output generator much. Also, when I run fs-drift on my machine, it produces valid jsons. However, when I try to parse it by the procedure from trigger_fs_drift.py, I get the same error. Copied from trigger_fs_drift.py:
import json
import os
rsptime_dir = '/tmp/mydir/network-shared/'
for fn in os.listdir(rsptime_dir):
if fn.startswith("counters") and fn.endswith("json"):
pathnm = os.path.join(rsptime_dir, fn)
with open(pathnm, "r") as f:
records = [line.strip() for line in f.readlines()]
json_start = 0
for index, record in enumerate(records):
print(record)
if record == "{":
json_start = index
if record == "}{" or record == "}":
# extract next JSON string from counter logfile
json_str = " ".join(records[json_start:index])
json_str += " }"
if record == "}{":
records[index] = "{"
json_start = index
json_obj = json.loads(json_str)
The culprit is in the building of json_str. Here is a valid json produced by fs-drift:
[{
"created": 1141,
"deleted": 122,
"softlinked": 14,
"hardlinked": 102,
"appended": 7132,
"randomly_written": 1169,
"sequentially_read": 2376,
"randomly_read": 1234,
"renamed": 1198,
"truncated": 67,
"remounted": 0,
"readdir": 138,
"read_requests": 2376,
"read_bytes": 8789830,
"randread_requests": 1233,
"randread_bytes": 2431366,
"write_requests": 8273,
"write_bytes": 17014543,
"randwrite_requests": 1169,
"randwrite_bytes": 2233762,
"randdiscard_requests": 0,
"randdiscard_bytes": 0,
"fsyncs": 1942,
"fdatasyncs": 941,
"dirs_created": 0,
"e_already_exists": 4885,
"e_file_not_found": 3519,
"e_no_dir_space": 0,
"e_no_inode_space": 0,
"e_no_space": 0,
"e_not_mounted": 0,
"e_could_not_unmount": 0,
"e_stale_fh": 0,
"e_could_not_mount": 0,
"e_dir_not_found": 0,
"elapsed-time": " 5.0",
"total-errors": " 0"
}
]
So what happens is, in the innermost for block, it cycles through all the lines up until the second to last one (right curly bracket), which passes the conditional, then the code joins the lines to form a json_str and slaps right curly bracket on the end, resulting json_str contains this string:
[{
"created": 1141,
"deleted": 122,
"softlinked": 14,
"hardlinked": 102,
"appended": 7132,
"randomly_written": 1169,
"sequentially_read": 2376,
"randomly_read": 1234,
"renamed": 1198,
"truncated": 67,
"remounted": 0,
"readdir": 138,
"read_requests": 2376,
"read_bytes": 8789830,
"randread_requests": 1233,
"randread_bytes": 2431366,
"write_requests": 8273,
"write_bytes": 17014543,
"randwrite_requests": 1169,
"randwrite_bytes": 2233762,
"randdiscard_requests": 0,
"randdiscard_bytes": 0,
"fsyncs": 1942,
"fdatasyncs": 941,
"dirs_created": 0,
"e_already_exists": 4885,
"e_file_not_found": 3519,
"e_no_dir_space": 0,
"e_no_inode_space": 0,
"e_no_space": 0,
"e_not_mounted": 0,
"e_could_not_unmount": 0,
"e_stale_fh": 0,
"e_could_not_mount": 0,
"e_dir_not_found": 0,
"elapsed-time": " 5.0",
"total-errors": " 0"
}
Which is invalid, because there is no closing square bracket and the JSON decoder gets confused and expects another dict (separated by ','). Unfortunatelly, there is no debug to be done on fs-drift side to remedy this. Instead, I'd recommend to just pass the whole file contents to the JSON decoder and work with the resulting structure (list of dicts).
Hope I helped, Sam
thank you for the analysis Sam, I'll take a look at the trigger module and see why this happened.
@spetrovi this is worse than I thought, I absolutely should load the counter data with json.load() and then generate separate docs for each counter snapshot, much easier that way. Coding that up now.
@rsevilla87 benchmark-wrapper PR 414 has merged, this should resolve this issue. Let me know if it doesn't. quay.io build has been triggered but has not yet completed.
Hey @bengland2, the CI test failed again, the error now seems unrelated to fs-drift though. I've seen this error before, and it was caused by an update of the elasticsearch python client version (latest versions are not compatible with AWS elasticsearch). However this version is fixed in setup.cfg so it's something we'll have to investigate deeper. https://github.com/cloud-bulldozer/benchmark-wrapper/blob/master/setup.cfg#L24
[pod/fs-drift-benchmark-client-3f5a0b88-1-5xlqb/fs-drift] RUN STATUS DONE
[pod/fs-drift-benchmark-client-3f5a0b88-1-5xlqb/backpack] Input dictionary is empty for module ocp_default_ingress_controller
[pod/fs-drift-benchmark-client-3f5a0b88-1-5xlqb/backpack] Input dictionary is empty for module ocp_dns
[pod/fs-drift-benchmark-client-3f5a0b88-1-5xlqb/backpack] Input dictionary is empty for module ocp_kube_apiserver
[pod/fs-drift-benchmark-client-3f5a0b88-1-5xlqb/backpack] Input dictionary is empty for module ocp_kube_controllermanager
[pod/fs-drift-benchmark-client-3f5a0b88-1-5xlqb/backpack] Input dictionary is empty for module ocp_network_operator
[pod/fs-drift-benchmark-client-3f5a0b88-1-5xlqb/backpack] Unknown indexing error: The client noticed that the server is not a supported distribution of Elasticsearch
[pod/fs-drift-benchmark-client-3f5a0b88-1-5xlqb/backpack] Indexing exception found The client noticed that the server is not a supported distribution of Elasticsearch
[pod/fs-drift-benchmark-client-3f5a0b88-1-5xlqb/backpack] Closing Redis connection
[pod/fs-drift-benchmark-client-3f5a0b88-1-5xlqb/backpack] Attempting to close ES connection
maybe the CI test script is targeting the wrong ES server? Will investigate.
note that even backpack fails to log data, so this really has nothing to do with fs-drift per se, but it may be related to how the container image is built. If we are going to log this error, it would be nice if it logged the client and server ES version encountered?
[bengland@localhost ripsaw]$ oc logs -f fs-drift-cephfs-client-3b4ffc0f-1--1-qvmgn -c backpack Input dictionary is empty for module ocp_default_ingress_controller Input dictionary is empty for module ocp_dns Input dictionary is empty for module ocp_kube_apiserver Input dictionary is empty for module ocp_kube_controllermanager Input dictionary is empty for module ocp_network_operator Unknown indexing error: The client noticed that the server is not a supported distribution of Elasticsearch Indexing exception found The client noticed that the server is not a supported distribution of Elasticsearch Closing Redis connection Attempting to close ES connection uuid: 3b4ffc0f-7430-53c6-a683-760677e4b3c8
@rsevilla87 fs-drift itself worked! and here is the server ES version. I expect with the fix you just added that it should work for backpack also.
2022-03-21T14:35:43Z - INFO - MainProcess - run_snafu: Using index prefix for ES: ripsaw-fs-drift 2022-03-21T14:35:43Z - INFO - MainProcess - run_snafu: Connected to the elasticsearch cluster with info as follows: 2022-03-21T14:35:43Z - INFO - MainProcess - run_snafu: { "name": "5619b1dd33c821625f405feab5045dcd", "cluster_name": "415909267177:perfscale-dev", "cluster_uuid": "Xz2IU4etSieAeaO2j-QCUw", "version": { "number": "7.10.2", "build_flavor": "oss", "build_type": "tar", "build_hash": "unknown", "build_date": "2021-09-29T11:42:59.634166Z", "build_snapshot": false, "lucene_version": "8.7.0", "minimum_wire_compatibility_version": "6.8.0", "minimum_index_compatibility_version": "6.0.0-beta1" }, "tagline": "You Know, for Search" } ... 2022-03-21T14:40:51Z - INFO - MainProcess - trigger_fs_drift: process 149 intervals from rates-over-time file counters.01.fs-drift-cephfs-client-3b4ffc0f-1--1-qvmgn.json 2022-03-21T14:40:51Z - INFO - MainProcess - run_snafu: Indexed results - 450 success, 0 duplicates, 0 failures, with 0 retries. 2022-03-21T14:40:51Z - INFO - MainProcess - run_snafu: Duration of execution - 0:05:08, with total size of 108000 bytes
Oh right!, I didn't initially realize the hanging container was backpack. I think we can close this issue :)
@bengland2, after re-running fs-drift tests in CI I've seen these messages
fs_drift-hostpath2/2 in 114 sec ✗ fs_drift-hostpath [112839]
(from function `kubectl_exec' in file ./helpers.bash, line 91,
from function `die' in file ./helpers.bash, line 80,
from function `check_es' in file ./helpers.bash, line 24,
in test file ./017-fs_drift.bats, line 26)
`check_es' failed
benchmark.ripsaw.cloudbulldozer.io/fs-drift-hostpath-benchmark created
Waiting for UUID from fs-drift-hostpath-benchmark
Looking for documents with uuid: ddd501e7-b32d-55b6-8d62-d63d8a7672f7 in index ripsaw-fs-drift-results
Looking for documents with uuid: ddd501e7-b32d-55b6-8d62-d63d8a7672f7 in index ripsaw-fs-drift-rsptimes
Error message: 0 documents found in index ripsaw-fs-drift-rsptimes
Seems like no documents were found in the ripsaw-fs-drift-rsptimes index, are we still using it?
@rsevilla87 yes I'm still using it but if the test isn't long enough, then nothing gets put in there and the check_es fails, my bad. I'll figure out what to do with that.
@rsevilla87 yes I'm still using it but if the test isn't long enough, then nothing gets put in there and the check_es fails, my bad. I'll figure out what to do with that.
We can maybe increase the duration in both test cases, don't we? https://github.com/cloud-bulldozer/benchmark-operator/tree/master/e2e/fs_drift
Hey @bengland2, the tests under the tests directory are not used anymore in benchmark-operator, they are actually used by benchmark-wrapper. I'll update the correct ones, which are under the e2e directory
Oh, I had no idea that had changed, thanks @rsevilla87
Dang, I figured it out, if you run for less than 240 seconds it decides to skip response time processing due to not enough samples and that's why the CI test fails, because there will be no documents in ripsaw-fs-drift-rsptimes then. Changing my PR accordingly.
fs-drift e2e tests are haging with the following traceback
More info and logs at https://github.com/cloud-bulldozer/benchmark-operator/actions/runs/1960169819
Maybe a recent change in fs-drift @bengland2?