criteo / hwbench

hwbench is a benchmark orchestration tool to automate the low-level testing of servers.
Apache License 2.0
20 stars 4 forks source link

[BUG] Memrate graph generation issue #58

Open ezekriSCW opened 2 weeks ago

ezekriSCW commented 2 weeks ago

Describe the bug During memrate graph generation with a HPE server, an error occurs preventing the process to finish --> all graphs are not generated. Note that only 1 DIMM of 32G is present in this server

To Reproduce Steps to reproduce the behavior (supposing that's due to single DIMM presence)

  1. Run hwbench with this specific command line (on a server with a single 32G DIMM) uv run hwbench -j configs/simple.conf -m monitoring.cfg
  2. Run hwbench with this specific command line uv run hwgraph graph --traces hwbench-out-20241107131337/results.json:DLxxx:BMC.Server --outdir DLxxx_graph
  3. hwbench crashes with the error below Fatal: DLxxx/memrate_116: unable to find metric write8/sum_speed

Expected behavior graph generation should go to the end with all graphs generated.

Benchmark configuration default files: simple.conf and monitoring.cfg (with BMC creds) have been usedd

Logs If applicable, add logs to help explain your problem.

Environment (please complete the following information):

anisse commented 1 week ago

What version of stress-ng was used in this case? Can you share the results.json? Or eventually, just a subset:

jq '.bench.memrate_116' < hwbench-out-20241107131337/results.json

ezekriSCW commented 1 week ago

stress-ng version: V0.17.04 attached an extract from results.json memrate.json

Thanks @anisse

anisse commented 1 week ago

I have analyzed the output data, and I'm not sure I understand what happened. We would need to solve #60 to have more complete output data. I tried a run on a server with the same CPU: I was not able to reproduce the problem.

If you re-run hwbench, does it always have the same issue on graph generation ?

Also, if you want to analyze the result anyway, it should be possible to remove the memrate_116 job from results.json and re-run hwgraph.

anisse commented 1 week ago

I tried removing only the memrate_116 job from results.json, and hwgraph can go to the end and generate all its graphs.

ezekriSCW commented 6 days ago

hwbench has been relaunched with 8x32G DIMMs instead of 1x32G DIMM, and all graphs have been generated as expected Note that I haven't re-run hwbench with a single DIMM as performed initially, so I cannot reproduce the problem for now.