[pbench-agent-0.71.0-130ga9672a901.noarch] pbench-trafficgen failed before start iteration

pradiptapks commented 3 years ago

Using the pbench test repo, I updated the pbench-agent package. But while test pbench-trafficgen execution failed where it indicated tools-default directory doesn't exist.

Platform: Red Hat Enterprise Linux release 8.2 (Ootpa)

Reproduced Steps:

Upgrade the pbench packages

# rpm -qa *pbench*
pbench-perl-JSON-XS-4.02-1.x86_64
pbench-sysstat-12.0.3-1.x86_64
pbench-agent-0.71.0-130ga9672a901.noarch
pbench-perl-common-sense-3.74-1.x86_64
pbench-perl-Types-Serialiser-1.0-1.noarch

Installed the tools

# pbench-list-tools 
default: perf122.perf.lab.eng.bos.redhat.com ['iostat', 'mpstat', 'perf', 'pidstat', 'proc-interrupts', 'proc-vmstat', 'sar', 'turbostat']
default: 10.16.31.48 ['iostat', 'mpstat', 'perf', 'pidstat', 'proc-interrupts', 'proc-vmstat', 'sar', 'turbostat']
default: 192.168.24.8 ['iostat', 'mpstat', 'openvswitch', 'perf', 'pidstat', 'proc-interrupts', 'proc-vmstat', 'sar', 'turbostat']

Trafficgen Command details for execution


PCI_INFO="06:00.0,06:00.1,08:00.0,08:00.1,85:00.0,85:00.1,83:00.0,83:00.1"
CONFIG="Trial"
PROFILE="/opt/trafficgen/trex-profiles/test/psahoo-profile-bs.json"
SAMPLE="1"
LOSS="0.0"
RATE="100"
UNIT="%"
LOG="/root/logs/${CONFIG}.log"
SEARCH_TIME="60"
VALIDATION_TIME="120"

time pbench-trafficgen --traffic-generator=trex-txrx-profile \ --devices=$PCI_INFO --traffic-profile=$PROFILE --rate=$RATE --rate-unit=$UNIT \ --samples=$SAMPLE --max-loss-pct=$LOSS --config=$CONFIG \ --tool-period=binary-search --skip-git-pull \ --search-runtime=$SEARCH_TIME --validation-runtime=$VALIDATION_TIME \ -- --rate-tolerance-failure=fail --disable-upward-search \ --loss-granularity=segment 2>&1 | tee $LOG

4. Error log

<..trim..> trex-server is ready
Total number of benchmark iterations: 1 Starting iteration[1-psahoo-profile-bs.json-0.0pct_drop] (1 of 1) test sample 1 of 1 [pbench-tool-trigger] starting trigger processing of STDIN using tool group default triggers at /var/lib/pbench-agent/tools-v1-default/trigger [pbench-tool-trigger] start-trigger:"Starting binary-search" stop-trigger:"Finished binary-search" [2021-02-02 03:01:00.439716][BSO] Namespace(active_device_pairs='0:1,2:3,4:5,6:7', device_pairs='0:1,2:3,4:5,6:7', disable_upward_search=True, dst_ips='', dst_macs='', dst_ports='', duplicate_packet_failure_mode='quit', enable_flow_cache= True, enable_segment_monitor=False, enable_trex_profiler=True, encap_dst_ips='', encap_dst_macs='', encap_src_ips='', encap_src_macs='', frame_size='64', latency_device_pair='--', latency_rate=1000, loss_granularity='segment', max_loss_pc t=0.0, max_retries=1, measure_latency=1, min_rate=0.0, negative_packet_loss_mode='quit', no_promisc=False, num_flows=1024, one_shot=0, output_dir='/var/lib/pbench-agent/trafficgen_Trial_tg:trex-profile_pf:psahoo-profile-bs.json_ml:0.0_tt: bs_2021-02-02T03:00:29/1-psahoo-profile-bs.json-0.0pct_drop/sample1', packet_protocol='UDP', pre_trial_cmd='', process_all_profiler_data=False, random_seed=0.3089222808406439, rate=0.0, rate_tolerance=3.0, rate_tolerance_failure='fail', r ate_unit='mpps', repeat_final_validation=False, runtime_tolerance=5, search_granularity=0.1, search_runtime=60, send_teaching_measurement=False, send_teaching_warmup=False, sniff_runtime=30, src_ips='', src_macs='', src_ports='', stream_m ode='continuous', teaching_measurement_interval=10.0, teaching_measurement_packet_rate=1000, teaching_measurement_packet_type='', teaching_warmup_packet_rate=1000, teaching_warmup_packet_type='', traffic_direction='bidirectional', traffic _generator='trex-txrx-profile', traffic_profile='/var/lib/pbench-agent/trafficgen_Trial_tg:trex-profile_pf:psahoo-profile-bs.json_ml:0.0_tt:bs_2021-02-02T03:00:29/1-psahoo-profile-bs.json-0.0pct_drop/psahoo-profile-bs.json', trex_host='lo calhost', trex_profiler_interval=3.0, trial_gap=0, use_device_stats=False, use_dst_ip_flows=1, use_dst_mac_flows=1, use_dst_port_flows=0, use_encap_dst_ip_flows=0, use_encap_dst_mac_flows=0, use_encap_src_ip_flows=0, use_encap_src_mac_fl$ ws=0, use_protocol_flows=0, use_src_ip_flows=1, use_src_mac_flows=1, use_src_port_flows=0, validation_runtime=120, vlan_ids='', vxlan_ids='', warmup_traffic_profile='', warmup_trial=False, warmup_trial_runtime=30) [2021-02-02 03:01:00.439949][BSO] The trex-txrx-profile traffic generator does not support --rate-unit=mpps [error][2021-02-02T03:01:00.458058888] iteration 1-psahoo-profile-bs.json-0.0pct_drop sample 1 returned non-zero exit code - 1 [error][2021-02-02T03:01:00.467098421] [pbench-stop-tools] expected tool output directory, "/var/lib/pbench-agent/trafficgen_Trial_tg:trex-profile_pf:psahoo-profile-bs.json_ml:0.0_tt:bs_2021-02-02T03:00:29/1-default/sample1/tools-default" , does not exist
tool triggers did not fire for iteration/sample, '1-psahoo-profile-bs.json-0.0pct_drop/sample1' [error][2021-02-02T03:01:00.470171817] Aborting benchmark
killing existing trex server

real 0m36.902s



5. Pbench log: 
http://pbench.perf.lab.eng.bos.redhat.com/results/perf122.perf.lab.eng.bos.redhat.com/nfv-osp16.1-rt-latency-trial/trafficgen_RHOS-16.1-RHEL-8-20201124.n.0-RT-OVS-OFFLOAD-1Q-PVP-NoLoss-BS_tg:trex-profile_pf:psahoo-profile-bs.json_ml:0.0_tt:bs_2021-02-01T14:53:32/

pradiptapks commented 3 years ago

@portante in perf122, I am downgrading pbench to continue my binary-search test. please let me know once there is fix available on this.

portante commented 3 years ago

Our first attempt at a fix for this was in PR #2071, which was flawed in that we caused other problems as a result of that change.

Our second attempt is now in PR #2090, where we are working first against the b0.69 branch to methodically (via lots of small PRs) massage the code into a better state to address the issue. PR #2090 addresses the same problem as PR #2071 by back-porting the original proposed fixed, while working to ensure the rest of the code does not break anything.

Further work will be required in order to ensure the generated result.json files in the iteration hierarchies are valid and work with the dashboard code.

webbnh commented 2 years ago

Our first attempt at a fix for this was in PR https://github.com/distributed-system-analysis/pbench/pull/2071

This report looks like an incompatibility between Pbench and Trafficgen on the invocation side, not on the output interpretation side (which is what I think #2071 addresses): it looks like the benchmark script is specifying --rate-unit=mpps to the traffic generator and it is failing as a result, which then causes knock-on effects, like the output directory not existing.

However, the logging indicates that pbench-trafficgen was invoked with --rate-unit=%, so I'm not sure how it ended up falling back on the default "mpps", but that's a second problem.

portante commented 2 years ago

TrafficGen is no longer supported starting in v0.71 and later.

distributed-system-analysis / pbench

[pbench-agent-0.71.0-130ga9672a901.noarch] pbench-trafficgen failed before start iteration #2088