ena_com_get_dev_basic_stats spin more than 1 sec

hhaim commented 4 years ago

DPDK 20.02

 0x5584cf3f0faa ./_t-rex-64(+0x194faa) [0x5584cf3f0faa]
2 0x7f0f006668a0 /lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0) [0x7f0f006668a0]
3 0x7f0f00661f85 pthread_cond_timedwait + 821
4 0x5584cf80e393 ena_com_execute_admin_command + 435
5 0x5584cf810659 ena_com_get_dev_basic_stats + 57
6 0x5584cf808129 ./_t-rex-64(+0x5ac129) [0x5584cf808129]
7 0x5584cf6d8020 rte_eth_stats_get + 128
8 0x5584cf490f30 CTRexExtendedDriverBase::get_extended_stats_fixed(CPhyEthIF*, CPhyEthIFStats*, int, int) + 32
9 0x5584cf362922 CPhyEthIF::get_extended_stats() + 28
10 0x5584cf362b67 CPhyEthIF::update_counters() + 17
11 0x5584cf36c73e CGlobalTRex::update_stats() + 56
12 0x5584cf36d657 CGlobalTRex::sync_threads_stats() + 9
13 0x5584cf36e457 CGlobalTRex::port_stats_to_json(Json::Value&, unsigned char) + 17
14 0x5584cf5440e2 TrexRpcCmdGetPortStats::_run(Json::Value const&, Json::Value&) + 66
15 0x5584cf534911 TrexRpcCommand::run(Json::Value const&, Json::Value&) + 81
16 0x5584cf530474 JsonRpcMethod::_execute(Json::Value&) + 52
17 0x5584cf52dbd3 TrexJsonRpcV2ParsedObject::execute(Json::Value&) + 131
18 0x5584cf52c16d TrexRpcServerReqRes::process_request_raw(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) + 541
19 0x5584cf52c90e TrexRpcServerReqRes::process_zipped_request(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) + 462
20 0x5584cf52cce5 TrexRpcServerReqRes::handle_request(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 469
21 0x5584cf52d2fb TrexRpcServerReqRes::_rpc_thread_cb_int() + 1339
22 0x5584cf52da2b TrexRpcServerReqRes::_rpc_thread_cb() + 11
23 0x7f0efff6b27f so/x86_64/libstdc++.so.6(+0xba27f) [0x7f0efff6b27f]
24 0x7f0f0065b6db /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f0f0065b6db]
25 0x7f0eff62ba3f clone + 63

This is a new crash https://github.com/cisco-system-traffic-generator/trex-core It means that the ena_com_get_dev_basic_stats spin for more that 1sec. Is it expected?

AWSNB commented 4 years ago

Can you share the instance type and instance-id ?

From: Hanoh Haim notifications@github.com Reply-To: amzn/amzn-drivers reply@reply.github.com Date: Monday, August 24, 2020 at 11:13 PM To: amzn/amzn-drivers amzn-drivers@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [amzn/amzn-drivers] ena_com_get_dev_basic_stats spin more than 1 sec (#140)

DPDK 20.02

0x5584cf3f0faa ./_t-rex-64(+0x194faa) [0x5584cf3f0faa]

2 0x7f0f006668a0 /lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0) [0x7f0f006668a0]

3 0x7f0f00661f85 pthread_cond_timedwait + 821

4 0x5584cf80e393 ena_com_execute_admin_command + 435

5 0x5584cf810659 ena_com_get_dev_basic_stats + 57

6 0x5584cf808129 ./_t-rex-64(+0x5ac129) [0x5584cf808129]

7 0x5584cf6d8020 rte_eth_stats_get + 128

8 0x5584cf490f30 CTRexExtendedDriverBase::get_extended_stats_fixed(CPhyEthIF, CPhyEthIFStats, int, int) + 32

9 0x5584cf362922 CPhyEthIF::get_extended_stats() + 28

10 0x5584cf362b67 CPhyEthIF::update_counters() + 17

11 0x5584cf36c73e CGlobalTRex::update_stats() + 56

12 0x5584cf36d657 CGlobalTRex::sync_threads_stats() + 9

13 0x5584cf36e457 CGlobalTRex::port_stats_to_json(Json::Value&, unsigned char) + 17

14 0x5584cf5440e2 TrexRpcCmdGetPortStats::_run(Json::Value const&, Json::Value&) + 66

15 0x5584cf534911 TrexRpcCommand::run(Json::Value const&, Json::Value&) + 81

16 0x5584cf530474 JsonRpcMethod::_execute(Json::Value&) + 52

17 0x5584cf52dbd3 TrexJsonRpcV2ParsedObject::execute(Json::Value&) + 131

18 0x5584cf52c16d TrexRpcServerReqRes::process_request_raw(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::cxx11::basic_string<char, std::char_traits, std::allocator >&) + 541

19 0x5584cf52c90e TrexRpcServerReqRes::process_zipped_request(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::cxx11::basic_string<char, std::char_traits, std::allocator >&) + 462

20 0x5584cf52cce5 TrexRpcServerReqRes::handle_request(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 469

21 0x5584cf52d2fb TrexRpcServerReqRes::_rpc_thread_cb_int() + 1339

22 0x5584cf52da2b TrexRpcServerReqRes::_rpc_thread_cb() + 11

23 0x7f0efff6b27f so/x86_64/libstdc++.so.6(+0xba27f) [0x7f0efff6b27f]

24 0x7f0f0065b6db /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f0f0065b6db]

25 0x7f0eff62ba3f clone + 63

This is a new crash https://github.com/cisco-system-traffic-generator/trex-core It means that the ena_com_get_dev_basic_stats spin for more that 1sec. Is it expected?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/amzn/amzn-drivers/issues/140, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFTRWCKJO3HELKSHUNC4KBDSCNI33ANCNFSM4QKIPKTQ.

hhaim commented 4 years ago

@AWSNB I have forward the questions. I think it is EC2-C5 type

hhaim commented 4 years ago

@AWSNB one more thing. For some it happens and for some it does not. It rather new thing from the past week

jaygmuru commented 4 years ago

happend on t3.xlarge i-089921d082eaf5bc6

AWSNB commented 4 years ago

We’ll check on our side

But in general, t3 is an burstable instance, with cpu credits and when instance run out of credit it goes down to base performance. I’m not saying this is the root cause yet, because we need to debug it

May I ask to run same test on an m5.xlarge and compare ?

From: jaygmuru notifications@github.com Reply-To: amzn/amzn-drivers reply@reply.github.com Date: Monday, August 24, 2020 at 11:27 PM To: amzn/amzn-drivers amzn-drivers@noreply.github.com Cc: "Bshara, Nafea" nafea@amazon.com, Mention mention@noreply.github.com Subject: Re: [amzn/amzn-drivers] ena_com_get_dev_basic_stats spin more than 1 sec (#140)

happend on t3.xlarge i-089921d082eaf5bc6

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/amzn/amzn-drivers/issues/140#issuecomment-679827865, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFTRWCIFPPKC7LMBLUV63CLSCNKRNANCNFSM4QKIPKTQ.

AWSNB commented 4 years ago

Reference to T2/T3 burst behavior: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-credits-baseline-concepts.html

From: "Bshara, Nafea" nafea@amazon.com Date: Monday, August 24, 2020 at 11:54 PM To: amzn/amzn-drivers reply@reply.github.com, amzn/amzn-drivers amzn-drivers@noreply.github.com Cc: Mention mention@noreply.github.com Subject: Re: [amzn/amzn-drivers] ena_com_get_dev_basic_stats spin more than 1 sec (#140)

We’ll check on our side

But in general, t3 is an burstable instance, with cpu credits and when instance run out of credit it goes down to base performance. I’m not saying this is the root cause yet, because we need to debug it

May I ask to run same test on an m5.xlarge and compare ?

From: jaygmuru notifications@github.com Reply-To: amzn/amzn-drivers reply@reply.github.com Date: Monday, August 24, 2020 at 11:27 PM To: amzn/amzn-drivers amzn-drivers@noreply.github.com Cc: "Bshara, Nafea" nafea@amazon.com, Mention mention@noreply.github.com Subject: Re: [amzn/amzn-drivers] ena_com_get_dev_basic_stats spin more than 1 sec (#140)

happend on t3.xlarge i-089921d082eaf5bc6

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/amzn/amzn-drivers/issues/140#issuecomment-679827865, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFTRWCIFPPKC7LMBLUV63CLSCNKRNANCNFSM4QKIPKTQ.

hhaim commented 4 years ago

@AWSNB thanks for the quick answer. We can disable the watchdog to verify if it's only 1 sec spin and not more. @jaygmuru could you disable the watchdog by adding --no-watchdog to the invocation. In any case I don't think you want to run on such instance that stuck for 1 sec for TGN purpose

mpastyl commented 3 years ago

(sorry for necrobumping) Were there any new insights on this? We see the same issue on a c5n.xlarge running VPP.

yastreb78 commented 3 years ago

@mpastyl Please share instance-id and time

mpastyl commented 3 years ago

@yastreb78 Thank you for your reply. Happened on i-00ffbaea3c3768628 at roughly Sept 16 09:15:24 UTC

I have appended the stack trace bellow. Basically we kill the application automatically if it fails to reply after a timeout, hence the signal at frame 4. Please note one difference with the original issue: the calling function is ena_com_get_eni_stats not ena_com_get_dev_basic_stats, but they both spin in the same function afterwards.

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ff0d20a0859 in __GI_abort () at abort.c:79
#2  0x000055beccd8afbe in os_exit (code=code@entry=1) at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/vpp/vnet/main.c:433
#3  0x00007ff0d23cb77c in unix_signal_handler (signum=11, si=<optimized out>, uc=<optimized out>) at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/vlib/unix/main.c:187
#4  <signal handler called>
#5  futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fef904ddf70, clockid=<optimized out>, expected=0, futex_word=0xac0203538) at ../sysdeps/nptl/futex-internal.h:320
#6  __pthread_cond_wait_common (abstime=0x7fef904ddf70, clockid=<optimized out>, mutex=0xac0203540, cond=0xac0203510) at pthread_cond_wait.c:520
#7  __pthread_cond_timedwait (cond=0xac0203510, mutex=0xac0203540, abstime=0x7fef904ddf70) at pthread_cond_wait.c:656
#8  0x00007fefd0eb7d86 in ena_com_execute_admin_command () from /usr/lib/x86_64-linux-gnu/vpp_plugins/dpdk_plugin.so
#9  0x00007fefd0eba0c5 in ena_com_get_eni_stats () from /usr/lib/x86_64-linux-gnu/vpp_plugins/dpdk_plugin.so
#10 0x00007fefd0eb1d7b in ena_copy_eni_stats () from /usr/lib/x86_64-linux-gnu/vpp_plugins/dpdk_plugin.so
#11 0x00007fefd0eb201c in ena_xstats_get () from /usr/lib/x86_64-linux-gnu/vpp_plugins/dpdk_plugin.so
#12 0x00007fefd0d4dea9 in rte_eth_xstats_get () from /usr/lib/x86_64-linux-gnu/vpp_plugins/dpdk_plugin.so
#13 0x00007fefd137ff75 in dpdk_get_xstats (xd=0x7fefd4bc1540) at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/plugins/dpdk/device/dpdk_priv.h:66
#14 dpdk_update_counters (now=<optimized out>, xd=0x7fefd4bc1540) at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/plugins/dpdk/device/dpdk_priv.h:116
#15 format_dpdk_device (s=0x7fefd4bd8ef0 "eth_Gp", ' ' <repeats 29 times>, "2     up   eth_Gp\n  Link speed: unknown\n  Ethernet address 06:79:e4:a7:a9:8f\n  ", args=<optimized out>)
    at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/plugins/dpdk/device/format.c:556
#16 0x00007ff0d22bb6ad in do_percent (va=0x7fef904de850, fmt=<optimized out>, _s=<synthetic pointer>) at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/vppinfra/format.c:373
#17 va_format (s=0x7fefd4bd8ef0 "eth_Gp", ' ' <repeats 29 times>, "2     up   eth_Gp\n  Link speed: unknown\n  Ethernet address 06:79:e4:a7:a9:8f\n  ", fmt=<optimized out>, va=va@entry=0x7fef904de850)
    at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/vppinfra/format.c:404
#18 0x00007ff0d22bca68 in format (s=<optimized out>, fmt=fmt@entry=0x7ff0d3363c4c "\n%U%U") at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/vppinfra/format.c:428
#19 0x00007ff0d2da000a in format_vnet_hw_interface (s=<optimized out>, args=<optimized out>) at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/vnet/interface_format.c:218
#20 0x00007ff0d22bb6ad in do_percent (va=0x7fef904dea78, fmt=<optimized out>, _s=<synthetic pointer>) at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/vppinfra/format.c:373
#21 va_format (s=s@entry=0x0, fmt=<optimized out>, va=va@entry=0x7fef904dea78) at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/vppinfra/format.c:404
#22 0x00007ff0d2364b5d in vlib_cli_output (vm=vm@entry=0x7ff0d23e7680 <vlib_global_main>, fmt=fmt@entry=0x7ff0d3335463 "%U\n") at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/vlib/cli.c:707
#23 0x00007ff0d2d9c6e9 in show_or_clear_hw_interfaces (vm=0x7ff0d23e7680 <vlib_global_main>, input=<optimized out>, is_show=1, cmd=<optimized out>) at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/vnet/interface_cli.c:131
#24 0x00007ff0d2365402 in vlib_cli_dispatch_sub_commands (vm=vm@entry=0x7ff0d23e7680 <vlib_global_main>, cm=cm@entry=0x7ff0d23e78d0 <vlib_global_main+592>, input=input@entry=0x7fef904dede0, parent_command_index=<optimized out>)
    at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/vlib/cli.c:574
#25 0x00007ff0d236550e in vlib_cli_dispatch_sub_commands (vm=vm@entry=0x7ff0d23e7680 <vlib_global_main>, cm=cm@entry=0x7ff0d23e78d0 <vlib_global_main+592>, input=input@entry=0x7fef904dede0, parent_command_index=parent_command_index@entry=0)
    at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/vlib/cli.c:529
#26 0x00007ff0d23659ad in vlib_cli_input (vm=vm@entry=0x7ff0d23e7680 <vlib_global_main>, input=input@entry=0x7fef904dede0, function=function@entry=0x55beccd95560 <inband_cli_output>, function_arg=function_arg@entry=140666894937520)
    at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/vlib/cli.c:678
#27 0x000055beccd95c26 in vl_api_cli_inband_t_handler (mp=0x7fefd4b8db00) at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/vpp/api/api.c:233
#28 0x00007ff0d3470c4c in msg_handler_internal (free_it=0, do_it=1, trace_it=<optimized out>, the_msg=0x7fefd4b8db00, am=0x7ff0d3483f40 <api_global_main>) at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/vlibapi/api_shared.c:505
#29 vl_msg_api_handler_no_free (the_msg=0x7fefd4b8db00) at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/vlibapi/api_shared.c:710
#30 0x00007ff0d3462431 in vl_socket_process_api_msg (rp=<optimized out>, input_v=<optimized out>) at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/vlibmemory/socket_api.c:209
#31 0x00007ff0d3467dcb in vl_api_clnt_process (vm=<optimized out>, node=<optimized out>, f=<optimized out>) at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/vlibmemory/vlib_api.c:405
#32 0x00007ff0d237e48b in vlib_process_bootstrap (_a=<optimized out>) at /codebuild/output/src073226041/src/github.com/EMnify/vpp/src/vlib/main.c:1477
#33 0x00007ff0d22c20a4 in clib_calljmp () from /lib/x86_64-linux-gnu/libvppinfra.so.21.01+emnify8.3.0
#34 0x0000000000000000 in ?? ()

ndagan commented 3 years ago

We are looking into this - will keep this thread updated.

shaibran commented 3 years ago

@mpastyl Thanks for the info, we will review the instance data from our side

shaibran commented 3 years ago

Hey, we reviewed the instances logs from our side but we did not observe any errors on the HW side when issue reproduced. @mpastyl @hhaim We would appreciate if you could assist with two reproductions attempts:

Check if this is resolved by increasing timeout on application to 4 seconds?
Enable ena_com logs (build the driver with RTE_LIBRTE_ENA_COM_DEBUG flag and then enable the logs by passing EAL argument: "--log-level=pmd.net.ena.com:8" if logging level is a problem on your side, you could decrease it from 8 down to 7 or 6; lower than this will not provide us any useful information)

We would like to attempt to reproduce this in house in parallel; We would appreciate if you could share the following information:

Confirm DPDK version
Confirm instance size and type
Frequency of this issue, if known
What is the utilization of the resources when this happens (how many CPUs, network, etc.)
Any commands, configuration or scenario which can assist us with reproduction. Thanks in advance

mpastyl commented 3 years ago

Hi @shaibran, thank you for taking the time to look at this. We have been trying to reproduce and isolate the issue ourselves.

Here are some updates:

Originally we were using VPP v21.01, that uses DPDK v20.11. We recently merged with the latest stable versions of VPP (v.21.06) which used DPDK v21.02. The problem still remains, but it happens at a lower frequency (1-2 times per day, as opposed to once every 2-3 hours) which is something we can live with for now.
I have attached some ena_com logs here. During this run, we saw the problem happen from roughly 11:31:08 to 11:31:10. During this period, the ena driver stops producing logs.

To answer your previous questions:

Check if this is resolved by increasing timeout on application to 4 seconds?

This would indeed stop the application from timing out, because we see that the function gets unstuck after roughly 3 sec. However, a spurious 3 sec freeze is a problem for our use case and we would like to avoid it.

Confirm DPDK version

Confirm instance size and type

Frequency of this issue, if known

What is the utilization of the resources when this happens (how many CPUs, network, etc.)

Any commands, configuration or scenario which can assist us with reproduction.

Currently running DPDK v21.02
c5n.xlarge
The problem happens infrequently and at seemingly random intervals. In our current setup (VPP v21.06 -> DPDK v21.02) it happens only 1-2 per day. We know that it is also related to how often our application queries device stats (currently every 30 sec). Setting the query frequency to 1 sec increased the frequency of the problem to 2-3 times per hour.
There is no noticeable change in resource utilization when the problem happens. We also did some profiling with perf. When the problem happens, perf does not seem to collect any samples for the main thread (the one that you can see in the stack trace)
Other (hopefully useful) information
- The arguments VPP uses to initialize DPDK EAL: -c e -n 4 --log-level pmd.net.ena.com:8 --in-memory --no-telemetry --file-prefix vpp -a 0000:00:06.0 -a 0000:00:07.0 -a 0000:00:08.0 --main-lcore 1
- I have tried to reproduce the problem in standalone DPDK by continuously measuring the time it takes to query xstats from devices but I do not see any long delays. Maybe the problem lies in conjunction with how VPP used DPDK.

Thank you for your time.

shaibran commented 3 years ago

Thank you @mpastyl for the detailed response, we will review this and update

mejedi commented 2 years ago

Together with @mpastyl we've been looking further into this issue. I believe that we've identified the issue and fixed the bug.

We observe the main thread occasionally waiting on a condition variable in ENA DPDK driver for 3 seconds. Found with Linux perf, tracing sched_switch events. The backtrace:

        7fa3d1b917b1 __pthread_cond_timedwait
  futex-internal.h:320 (inlined)
        7fa2d08be484 ena_com_wait_and_process_admin_cq_interrupts
  ena_com.c:764 (inlined)
        7fa2d08be484 ena_com_wait_and_process_admin_cq
  ena_com.c:764 (inlined)
        7fa2d08be484 ena_com_execute_admin_command (/usr/lib/x86_64-linux-gnu/vpp_plugins/dpdk_plugin.so)
  ena_com.c:764
        7fa2d08be7d9 ena_get_dev_stats (/usr/lib/x86_64-linux-gnu/vpp_plugins/dpdk_plugin.so)
  ena_com.c:2206
        7fa2d08c038a ena_com_get_dev_basic_stats (/usr/lib/x86_64-linux-gnu/vpp_plugins/dpdk_plugin.so)
  ena_com.c:2240
        7fa2d08b9f82 ena_stats_get (/usr/lib/x86_64-linux-gnu/vpp_plugins/dpdk_plugin.so)
  ena_ethdev.c:970
        7fa2d0786523 rte_eth_stats_get (/usr/lib/x86_64-linux-gnu/vpp_plugins/dpdk_plugin.so)
  rte_ethdev.c:2693
        7fa2d0786597 eth_basic_stats_get (/usr/lib/x86_64-linux-gnu/vpp_plugins/dpdk_plugin.so)
  rte_ethdev.c:3003
        7fa2d0786fdf rte_eth_xstats_get (/usr/lib/x86_64-linux-gnu/vpp_plugins/dpdk_plugin.so)
  rte_ethdev.c:3179
        ...

Checking the source code:

static int ena_com_wait_and_process_admin_cq_interrupts(struct ena_comp_ctx *comp_ctx,
                            struct ena_com_admin_queue *admin_queue)
{
    unsigned long flags = 0;
    int ret;

    ENA_WAIT_EVENT_WAIT(comp_ctx->wait_event, // <-- ena_com.c:764
                admin_queue->completion_timeout);

    /* In case the command wasn't completed find out the root cause.
     * There might be 2 kinds of errors
     * 1) No completion (timeout reached)
     * 2) There is completion but the device didn't get any msi-x interrupt.
     */
    if (unlikely(comp_ctx->status == ENA_CMD_SUBMITTED)) {
        ENA_SPINLOCK_LOCK(admin_queue->q_lock, flags);
        ena_com_handle_admin_completion(admin_queue);
        admin_queue->stats.no_completion++;
        ENA_SPINLOCK_UNLOCK(admin_queue->q_lock, flags);

        if (comp_ctx->status == ENA_CMD_COMPLETED) {
            ena_trc_err("The ena device sent a completion but the driver didn't receive a MSI-X interrupt (cmd %d), autopolling mode is %s\n",
                    comp_ctx->cmd_opcode, admin_queue->auto_polling ? "ON" : "OFF");
            /* Check if fallback to polling is enabled */
            if (admin_queue->auto_polling)
                admin_queue->polling = true;
        } else {
            ena_trc_err("The ena device didn't send a completion for the admin cmd %d status %d\n",
                    comp_ctx->cmd_opcode, comp_ctx->status);
        }

surprisingly, completion_timeout is exactly 3s. Looks like the wait on the CV times out.

Furthermore, our build had RTE_LIBRTE_ENA_COM_DEBUG enabled and we didn't see any errors, meaning that the request succeeded. Is it possible that ENA_WAIT_EVENT_WAIT somehow missed the wake up by ENA_WAIT_EVENT_SIGNAL?

#define q_waitqueue_t                   \
        struct {                        \
                pthread_cond_t cond;    \
                pthread_mutex_t mutex;  \
        }

#define ena_wait_queue_t q_waitqueue_t
#define ENA_WAIT_EVENT_WAIT(waitevent, timeout)                         \
        do {                                                            \
                struct timespec wait;                                   \
                struct timeval now;                                     \
                unsigned long timeout_us;                               \
                gettimeofday(&now, NULL);                               \
                wait.tv_sec = now.tv_sec + timeout / 1000000UL;         \
                timeout_us = timeout % 1000000UL;                       \
                wait.tv_nsec = (now.tv_usec + timeout_us) * 1000UL;     \
                pthread_mutex_lock(&waitevent.mutex);                   \
                pthread_cond_timedwait(&waitevent.cond,                 \
                                &waitevent.mutex, &wait);               \
                pthread_mutex_unlock(&waitevent.mutex);                 \
        } while (0)
#define ENA_WAIT_EVENT_SIGNAL(waitevent) pthread_cond_signal(&waitevent.cond)

This definitely doesn't stick to the textbook condition variable patterns!

We suspect that in extreme rare cases, ENA_WAIT_EVENT_SIGNAL is executed before ENA_WAIT_EVENT_WAIT (device being fast, the main thread being slow).

We implemented the change similar to 072b9f2bbc2 in http://dpdk.org/git/dpdk, and we were running for 48 hours without issues.

shaibran commented 2 years ago

Thanks @mejedi and @mpastyl for taking the time to assist with this investigation. We are reviewing this and will update shortly

shaibran commented 2 years ago

@mejedi and @mpastyl @hhaim Thank you again. Indeed commit 072b9f2bbc2402a8c86194fe9e11458c1605540a "net/ena: handle spurious wakeups in wait event" resolves the scenario where interrupt comes faster than ENA_WAIT_EVENT_WAIT, and was not part of DPDK v21.02 (included only in 21.05); We are preparing a troubleshooting document where we will include also recommendation to update ENA or cherry pick the specific commit.

oicnysa commented 5 hours ago

Hi guys, we running quite an old version of dpdk and experincing the same problem. The question is - do we need to update dpdk ena drivers only or also dpdk version or both?

amzn / amzn-drivers

ena_com_get_dev_basic_stats spin more than 1 sec #140