Azure / iot-edge-opc-proxy

OPC Proxy Module
34 stars 19 forks source link

Unhandled exception in proxyd from IOCP callback (Windows) #61

Closed JuergenKosel closed 7 years ago

JuergenKosel commented 7 years ago

Hello, proxyd started with just the -D Filename option crashes due to unhandled exception. This is triggered by the following line 123

        log_debug(log, "%d [%s(%p)] due:%u added:%u",
            index, next->name, next->context,
            next->deadline, next->queued);

of file prx_sched.c The local variabels have the following values:


        index   2   int
+       next    0x0000000ccd5ff660 {name=0xcccccccccccccccc <Fehler beim Lesen der Zeichen der Zeichenfolge.> queued=...}   prx_task_entry *
        next->context   0xcccccccccccccccc  void *
        next->deadline  -3689348814741910324    __int64
+       next->name  0xcccccccccccccccc <Fehler beim Lesen der Zeichen der Zeichenfolge.>    const char *
        next->queued    -3689348814741910324    __int64
marcschier commented 7 years ago

This is likely due to the default definition of LOG_VERBOSE on debug. I would recommend to comment the #define LOG_VERBOSE when debug is defined. I have already done that for 1.0.3 release as the debug spew was too much. Nevertheless, it should not crash even if it is defined. Investigating.

marcschier commented 7 years ago

I was able to reproduce this, but the key was to Ctrl-C the proxy at which point this regularly occurred.

I think it is the combination of LOG_VERBOSE and xlogging. Rather than commenting out LOG_VERBOSE (which is probably the right thing), could you try changing the code around line 397 (prx_sched.c) as follows:

    rw_lock_exit_w(scheduler->lock);
#ifdef LOG_VERBOSE
    prx_scheduler_log_queue(scheduler->log, &scheduler->now);
    prx_scheduler_log_queue(scheduler->log, &scheduler->later);
#endif

to

#ifdef LOG_VERBOSE
    prx_scheduler_log_queue(scheduler->log, &scheduler->now);
    prx_scheduler_log_queue(scheduler->log, &scheduler->later);
#endif
    rw_lock_exit_w(scheduler->lock);  // <--- move here

The queue should not be iterated over outside of the lock.

Testing this fixed the crash. Can you try the same - just to double check?

Thx.

JuergenKosel commented 7 years ago

Hello,

your patch fixes the crash. But I still see another issue (also if running the release variant), which leads to gracefull termination of proxyd:

Thu Jul 27 15:38:48 2017 (prx_scheduler_get_next:198) [DEBUG] Task pal_socket_try_close with latency: 15
Thu Jul 27 15:38:48 2017 (prx_buffer_pool_alloc_buffer:251) [DEBUG] Buffer allocated from socket-inbound (pool-count: 21, free: 19)
Thu Jul 27 15:38:48 2017 (prx_buffer_pool_free_buffer:288) [DEBUG] Buffer returned to scheduler (pool-count: 128, free: 123)
Thu Jul 27 15:38:48 2017 (prx_scheduler_get_next:198) [DEBUG] Task xio_socket_deliver_inbound_results with latency: 297
Thu Jul 27 15:38:48 2017 (prx_buffer_pool_alloc_buffer:251) [DEBUG] Buffer allocated from scheduler (pool-count: 128, free: 122)
Thu Jul 27 15:38:48 2017 (pal_os_last_net_error_as_prx_error:164) [INFO] Networking error code 10038 (Ein Vorgang bezog sich auf ein Objekt, das kein Socket ist.)
Thu Jul 27 15:38:48 2017 (xio_socket_on_end_receive:284) [ERROR] Error during pal receive (er_arg).

I have no idea, what triggers this. - But maybe this is really another issue.

marcschier commented 7 years ago

This happens when the IOCP is still in the process of receiving, but the (mqtt) socket was closed on the main scheduler thread as result of exiting. The receive ends and its error code is 10038/er_arg since the socket is not valid anymore. While normally CancelIoEx should have cancelled it I think, it appears that this is another outcome. IMO this should be ok/noise and could be filtered out.

JuergenKosel commented 7 years ago

If this error message is the result of exiting, then I need to find out, why the proxyd decides to terminate.

JuergenKosel commented 7 years ago

I have now the more relevant proxyd output:

=== Azure iot-edge-reverse-proxy 1.0.2. ===

Fri Jul 28 09:24:10 2017 (prx_fixed_pool_grow_no_lock:581) [DEBUG] Growing fixed pool scheduler by 128 (pool-count: 0, free: 0)

Fri Jul 28 09:24:10 2017 (prx_fixed_pool_grow_no_lock:581) [DEBUG] Growing fixed pool scheduler by 128 (pool-count: 0, free: 0)

Fri Jul 28 09:24:10 2017 (prx_fixed_pool_grow_no_lock:581) [DEBUG] Growing fixed pool scheduler by 128 (pool-count: 0, free: 0)

Fri Jul 28 09:24:10 2017 (prx_host_init:1029) [DEBUG] Proxy host created!

Fri Jul 28 09:24:10 2017 (prx_fixed_pool_grow_no_lock:581) [DEBUG] Growing fixed pool umqtt-receive by 10 (pool-count: 0, free: 0)

Fri Jul 28 09:24:10 2017 (prx_dynamic_pool_grow_no_lock:388) [DEBUG] Growing dynamic pool buffers by 11 (pool-count: 0, free: 0)

Fri Jul 28 09:24:10 2017 (prx_dynamic_pool_grow_no_lock:388) [DEBUG] Growing dynamic pool umqtt.connection.send by 11 (pool-count: 0, free: 0)

Fri Jul 28 09:24:10 2017 (prx_buffer_pool_alloc_buffer:251) [DEBUG] Buffer allocated from scheduler (pool-count: 128, free: 127)

Fri Jul 28 09:24:10 2017 (prx_scheduler_log_queue:121) [DEBUG] 0 [io_mqtt_connection_reconnect(000002A4CCB2FBE0)] (io_mqtt_connection_connect:1679) due:64948546 added:64948546

Fri Jul 28 09:24:10 2017 (prx_buffer_pool_alloc_buffer:251) [DEBUG] Buffer allocated from scheduler (pool-count: 128, free: 126)

Fri Jul 28 09:24:10 2017 (prx_scheduler_log_queue:121) [DEBUG] 0 [io_mqtt_connection_reconnect(000002A4CCB2FBE0)] (io_mqtt_connection_connect:1679) due:64948546 added:64948546

Fri Jul 28 09:24:10 2017 (prx_scheduler_log_queue:121) [DEBUG] 1 [io_iot_hub_umqtt_connection_send_alive_property(000002A4CCB2F350)] (io_iot_hub_umqtt_server_transport_create_connection:588) due:64948546 added:64948546

Fri Jul 28 09:24:10 2017 (pal_poll_event_loop:173) [DEBUG] Clearing event control channel.

Fri Jul 28 09:24:10 2017 (prx_scheduler_work:242) [DEBUG] Entering idle for 4294967295 ms (64948562)

Fri Jul 28 09:24:10 2017 (prx_scheduler_work:242) [DEBUG] Entering idle for 4294967295 ms (64948562)

Fri Jul 28 09:24:10 2017 (prx_scheduler_get_next:198) [DEBUG] Task io_mqtt_connection_reconnect with latency: 16

Fri Jul 28 09:24:10 2017 (prx_dynamic_pool_grow_no_lock:388) [DEBUG] Growing dynamic pool socket-outbound by 11 (pool-count: 0, free: 0)

Fri Jul 28 09:24:10 2017 (prx_dynamic_pool_grow_no_lock:388) [DEBUG] Growing dynamic pool socket-inbound by 11 (pool-count: 0, free: 0)

Fri Jul 28 09:24:10 2017 (prx_buffer_pool_alloc_buffer:251) [DEBUG] Buffer allocated from scheduler (pool-count: 128, free: 127)

Fri Jul 28 09:24:10 2017 (prx_scheduler_log_queue:121) [DEBUG] 0 [pal_socket_open_by_name_begin(000002A4CCB43E40)] (pal_socket_open:1787) due:64948562 added:64948562

Fri Jul 28 09:24:10 2017 (prx_scheduler_work:245) [DEBUG] Exit idle - idled for 0 ms

Fri Jul 28 09:24:10 2017 (prx_scheduler_get_next:198) [DEBUG] Task pal_socket_open_by_name_begin with latency: 0

Fri Jul 28 09:24:10 2017 (pal_getaddrinfo:1180) [INFO] Resolving softingconnectedfactoryf1605.azure-devices.net:8883 (family: 0)...

Fri Jul 28 09:24:10 2017 (prx_buffer_pool_alloc_buffer:251) [DEBUG] Buffer allocated from scheduler (pool-count: 128, free: 125)

Fri Jul 28 09:24:10 2017 (prx_scheduler_log_queue:121) [DEBUG] 0 [io_iot_hub_umqtt_connection_send_alive_property(000002A4CCB2F350)] (io_iot_hub_umqtt_server_transport_create_connection:588) due:64948546 added:64948546

Fri Jul 28 09:24:10 2017 (prx_scheduler_log_queue:121) [DEBUG] 0 [io_mqtt_connection_monitor(000002A4CCB2FBE0)] (io_mqtt_connection_reconnect:1094) due:64978562 added:64948562

Fri Jul 28 09:24:10 2017 (prx_scheduler_work:230) [DEBUG] Task io_mqtt_connection_reconnect took 0 ms

Fri Jul 28 09:24:10 2017 (prx_buffer_pool_free_buffer:288) [DEBUG] Buffer returned to scheduler (pool-count: 128, free: 126)

Fri Jul 28 09:24:10 2017 (prx_scheduler_get_next:198) [DEBUG] Task io_iot_hub_umqtt_connection_send_alive_property with latency: 16

Fri Jul 28 09:24:10 2017 (prx_buffer_pool_alloc_buffer:251) [DEBUG] Buffer allocated from scheduler (pool-count: 128, free: 125)

Fri Jul 28 09:24:10 2017 (prx_scheduler_log_queue:121) [DEBUG] 0 [io_mqtt_connection_publish_message(000002A4CCB2FBE0)] (io_mqtt_connection_publish:1489) due:64948562 added:64948562

Fri Jul 28 09:24:10 2017 (pal_net_trace_resolved:36) [DEBUG]   ... softingconnectedfactoryf1605.azure-devices.net:8883 -> 13.95.15.251:8883 (ihsu-prod-am-002.cloudapp.net)

Fri Jul 28 09:24:10 2017 (prx_scheduler_log_queue:121) [DEBUG] 0 [io_mqtt_connection_monitor(000002A4CCB2FBE0)] (io_mqtt_connection_reconnect:1094) due:64978562 added:64948562

Fri Jul 28 09:24:10 2017 (prx_buffer_pool_alloc_buffer:251) [DEBUG] Buffer allocated from scheduler (pool-count: 128, free: 126)

Fri Jul 28 09:24:10 2017 (prx_buffer_pool_alloc_buffer:251) [DEBUG] Buffer allocated from scheduler (pool-count: 128, free: 124)

Fri Jul 28 09:24:10 2017 (prx_scheduler_log_queue:121) [DEBUG] 0 [pal_socket_open_next_begin(000002A4CCB43E40)] (pal_socket_open_by_name_begin:1723) due:64948625 added:64948625

Fri Jul 28 09:24:10 2017 (prx_scheduler_log_queue:121) [DEBUG] 0 [io_mqtt_connection_publish_message(000002A4CCB2FBE0)] (io_mqtt_connection_publish:1489) due:64948562 added:64948562

Fri Jul 28 09:24:10 2017 (prx_scheduler_work:230) [DEBUG] Task pal_socket_open_by_name_begin took 63 ms

Fri Jul 28 09:24:10 2017 (prx_scheduler_log_queue:121) [DEBUG] 0 [io_iot_hub_umqtt_connection_send_alive_property(000002A4CCB2F350)] (io_iot_hub_umqtt_connection_send_alive_property:235) due:64958625 added:64948625

Fri Jul 28 09:24:10 2017 (prx_buffer_pool_free_buffer:288) [DEBUG] Buffer returned to scheduler (pool-count: 128, free: 127)

Fri Jul 28 09:24:10 2017 (prx_scheduler_log_queue:121) [DEBUG] 1 [io_mqtt_connection_monitor(000002A4CCB2FBE0)] (io_mqtt_connection_reconnect:1094) due:64978562 added:64948562

Fri Jul 28 09:24:10 2017 (prx_scheduler_get_next:198) [DEBUG] Task pal_socket_open_next_begin with latency: 0

Fri Jul 28 09:24:10 2017 (prx_scheduler_work:230) [DEBUG] Task io_iot_hub_umqtt_connection_send_alive_property took 63 ms

Fri Jul 28 09:24:10 2017 (prx_buffer_pool_free_buffer:288) [DEBUG] Buffer returned to scheduler (pool-count: 128, free: 125)

Fri Jul 28 09:24:10 2017 (prx_scheduler_get_next:198) [DEBUG] Task io_mqtt_connection_publish_message with latency: 63

Fri Jul 28 09:24:10 2017 (prx_scheduler_work:230) [DEBUG] Task io_mqtt_connection_publish_message took 0 ms

Fri Jul 28 09:24:10 2017 (prx_buffer_pool_free_buffer:288) [DEBUG] Buffer returned to scheduler (pool-count: 128, free: 126)

Fri Jul 28 09:24:10 2017 (prx_scheduler_work:242) [DEBUG] Entering idle for 10000 ms (64948625)

Fri Jul 28 09:24:10 2017 (prx_scheduler_work:245) [DEBUG] Exit idle - idled for 0 ms

Fri Jul 28 09:24:10 2017 (prx_scheduler_work:242) [DEBUG] Entering idle for 10000 ms (64948625)

Fri Jul 28 09:24:10 2017 (prx_scheduler_work:230) [DEBUG] Task pal_socket_open_next_begin took 0 ms

Fri Jul 28 09:24:10 2017 (prx_buffer_pool_free_buffer:288) [DEBUG] Buffer returned to scheduler (pool-count: 128, free: 128)

Fri Jul 28 09:24:10 2017 (prx_scheduler_work:242) [DEBUG] Entering idle for 4294967295 ms (64948625)

Fri Jul 28 09:24:10 2017 (prx_scheduler_work:245) [DEBUG] Exit idle - idled for 0 ms

Fri Jul 28 09:24:10 2017 (prx_scheduler_work:242) [DEBUG] Entering idle for 4294967295 ms (64948625)

Fri Jul 28 09:24:17 2017 (pal_sdclient_create:857) [ERROR] Failed creating connection to service: -65563 (er_bad_state)

Fri Jul 28 09:24:17 2017 (prx_scheduler_release:476) [DEBUG] Releasing scheduler queue for 000002A4CCB2F520

Fri Jul 28 09:24:17 2017 (prx_scheduler_log_queue:121) [DEBUG] 0 [io_iot_hub_umqtt_connection_send_alive_property(000002A4CCB2F350)] (io_iot_hub_umqtt_connection_send_alive_property:235) due:64958625 added:64948625

Fri Jul 28 09:24:17 2017 (prx_scheduler_log_queue:121) [DEBUG] 1 [io_mqtt_connection_monitor(000002A4CCB2FBE0)] (io_mqtt_connection_reconnect:1094) due:64978562 added:64948562

Fri Jul 28 09:24:17 2017 (prx_buffer_pool_alloc_buffer:251) [DEBUG] Buffer allocated from scheduler (pool-count: 128, free: 125)
marcschier commented 7 years ago

Ah, now I understand, the host exiting is not deliberate.

Fri Jul 28 09:24:17 2017 (pal_sdclient_create:857) [ERROR] Failed creating connection to service: -65563 (er_bad_state)

seems to be the smoking gun. Which makes sense, as 99% of the time the Linux docker container is used which has Avahi embedded, and all remaining Windows testing is now done on machines that have Bonjour installed. Not having Bonjour would probably result in this being an issue. But I need to confirm.

Workaround that you can try while I look into this is to a) build with--use-dnssd No b) install bonjour service on your windows machine (which you would need for LDS-ME also).

JuergenKosel commented 7 years ago

With the build option --use-dnssd No the proxyd works now as it should.

marcschier commented 7 years ago

I need to revise my statement about iocp. There is likely an issue where a iocp callback arrives after the socket is closed and memory freed even though the ov was previously completed, and that worst case can result in a crash. It is likely related to er_arg also since bad memory FD might be passed into recv, not a closed handle. We shall investigate and release a fix. Probably a problem in the submit state machine.