awslabs / aws-crt-java

Java bindings for the AWS Common Runtime
Apache License 2.0
55 stars 40 forks source link

Fatal error when trying to connect and no connection available #215

Closed MMaiero closed 4 years ago

MMaiero commented 4 years ago

Hello, we are using the aws-crt library in association with the aws-iot-device-sdk-java-v2 for our java based connector for AWS IoT core. We have noticed that, if the devices is not able to connect to the Internet and we try to connect to AWS IoT Core, it happens that the JVM is restarted. Looking at the errors reported, the following can be

Fatal error condition occurred in /work/aws-common-runtime/aws-c-common/source/allocator.c:166: allocator != ((void *)0)
Exiting Application
################################################################################
Resolved stacktrace:
################################################################################
################################################################################
Raw stacktrace:
################################################################################

This happens with our current bundle with aws-crt-java in version 0.5.8 and aws-iot-device-sdk-java-v2 in version 1.1.1 but also with the latest combination available (0.6.5 and 1.2.5).

To be noticed that the JVM reports also this:

OpenJDK Client VM warning: You have loaded library /tmp/AWSCRT_15959522002727324888219554217087libaws-crt-jni.so which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
bretambrose commented 4 years ago

What platform is this on? Is it reproducible just by running a sample with the network adapter disabled?

MMaiero commented 4 years ago

I have run it on an ARM device. I had no chance to test an example.

MMaiero commented 4 years ago

I can recreate the issue with the following code:

/**
 * Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 * SPDX-License-Identifier: Apache-2.0.
 */

package pubsub;

import software.amazon.awssdk.crt.CRT;
import software.amazon.awssdk.crt.CrtRuntimeException;
import software.amazon.awssdk.crt.auth.credentials.X509CredentialsProvider;
import software.amazon.awssdk.crt.http.HttpProxyOptions;
import software.amazon.awssdk.crt.io.ClientBootstrap;
import software.amazon.awssdk.crt.io.ClientTlsContext;
import software.amazon.awssdk.crt.io.EventLoopGroup;
import software.amazon.awssdk.crt.io.HostResolver;
import software.amazon.awssdk.crt.io.TlsContextOptions;
import software.amazon.awssdk.crt.mqtt.MqttClientConnection;
import software.amazon.awssdk.crt.mqtt.MqttClientConnectionEvents;
import software.amazon.awssdk.crt.mqtt.MqttMessage;
import software.amazon.awssdk.crt.mqtt.QualityOfService;
import software.amazon.awssdk.iot.AwsIotMqttConnectionBuilder;
import software.amazon.awssdk.iot.iotjobs.model.RejectedError;

import java.io.UnsupportedEncodingException;
import java.util.UUID;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ExecutionException;

public class PubSub {
    static String clientId = "test-" + UUID.randomUUID().toString();
    static String rootCaPath;
    static String certPath;
    static String keyPath;
    static String endpoint;
    static String topic = "test/topic";
    static String message = "Hello World!";
    static int    messagesToPublish = 10;
    static boolean showHelp = false;
    static int port = 8883;

    static String proxyHost;
    static int proxyPort;
    static String region = "us-east-1";
    static boolean useWebsockets = false;
    static boolean useX509Credentials = false;
    static String x509RoleAlias;
    static String x509Endpoint;
    static String x509Thing;
    static String x509CertPath;
    static String x509KeyPath;
    static String x509RootCaPath;

    static void printUsage() {
        System.out.println(
                "Usage:\n"+
                "  --help            This message\n"+
                "  --clientId        Client ID to use when connecting (optional)\n"+
                "  -e|--endpoint     AWS IoT service endpoint hostname\n"+
                "  -p|--port         Port to connect to on the endpoint\n"+
                "  -r|--rootca       Path to the root certificate\n"+
                "  -c|--cert         Path to the IoT thing certificate\n"+
                "  -k|--key          Path to the IoT thing private key\n"+
                "  -t|--topic        Topic to subscribe/publish to (optional)\n"+
                "  -m|--message      Message to publish (optional)\n"+
                "  -n|--count        Number of messages to publish (optional)\n" +
                "  -w|--websockets   Use websockets\n" +
                "  --proxyhost       Websocket proxy host to use\n" +
                "  --proxyport       Websocket proxy port to use\n" +
                "  --region          Websocket signing region to use\n" +
                "  --x509            Use the x509 credentials provider while using websockets\n" +
                "  --x509rolealias   Role alias to use with the x509 credentials provider\n" +
                "  --x509endpoint    Endpoint to fetch x509 credentials from\n" +
                "  --x509thing       Thing name to fetch x509 credentials on behalf of\n" +
                "  --x509cert        Path to the IoT thing certificate used in fetching x509 credentials\n" +
                "  --x509key         Path to the IoT thing private key used in fetching x509 credentials\n" +
                "  --x509rootca      Path to the root certificate used in fetching x509 credentials\n"
        );
    }

    static void parseCommandLine(String[] args) {
        for (int idx = 0; idx < args.length; ++idx) {
            switch (args[idx]) {
                case "--help":
                    showHelp = true;
                    break;
                case "--clientId":
                    if (idx + 1 < args.length) {
                        clientId = args[++idx];
                    }
                    break;
                case "-e":
                case "--endpoint":
                    if (idx + 1 < args.length) {
                        endpoint = args[++idx];
                    }
                    break;
                case "-p":
                case "--port":
                    if (idx + 1 < args.length) {
                        port = Integer.parseInt(args[++idx]);
                    }
                    break;
                case "-r":
                case "--rootca":
                    if (idx + 1 < args.length) {
                        rootCaPath = args[++idx];
                    }
                    break;
                case "-c":
                case "--cert":
                    if (idx + 1 < args.length) {
                        certPath = args[++idx];
                    }
                    break;
                case "-k":
                case "--key":
                    if (idx + 1 < args.length) {
                        keyPath = args[++idx];
                    }
                    break;
                case "-t":
                case "--topic":
                    if (idx + 1 < args.length) {
                        topic = args[++idx];
                    }
                    break;
                case "-m":
                case "--message":
                    if (idx + 1 < args.length) {
                        message = args[++idx];
                    }
                    break;
                case "-n":
                case "--count":
                    if (idx + 1 < args.length) {
                        messagesToPublish = Integer.parseInt(args[++idx]);
                    }
                    break;
                case "-w":
                    useWebsockets = true;
                    break;
                case "--x509":
                    useX509Credentials = true;
                    useWebsockets = true;
                    break;
                case "--x509rolealias":
                    if (idx + 1 < args.length) {
                        x509RoleAlias = args[++idx];
                    }
                    break;
                case "--x509endpoint":
                    if (idx + 1 < args.length) {
                        x509Endpoint = args[++idx];
                    }
                    break;
                case "--x509thing":
                    if (idx + 1 < args.length) {
                        x509Thing = args[++idx];
                    }
                    break;
                case "--x509cert":
                    if (idx + 1 < args.length) {
                        x509CertPath = args[++idx];
                    }
                    break;
                case "--x509key":
                    if (idx + 1 < args.length) {
                        x509KeyPath = args[++idx];
                    }
                    break;
                case "--x509rootca":
                    if (idx + 1 < args.length) {
                        x509RootCaPath = args[++idx];
                    }
                    break;
                case "--proxyhost":
                    if (idx + 1 < args.length) {
                        proxyHost = args[++idx];
                    }
                    break;
                case "--proxyport":
                    if (idx + 1 < args.length) {
                        proxyPort = Integer.parseInt(args[++idx]);
                    }
                    break;
                case "--region":
                    if (idx + 1 < args.length) {
                        region = args[++idx];
                    }
                    break;
                default:
                    System.out.println("Unrecognized argument: " + args[idx]);
            }
        }
    }

    static void onRejectedError(RejectedError error) {
        System.out.println("Request rejected: " + error.code.toString() + ": " + error.message);
    }

    public static void main(String[] args) {
        parseCommandLine(args);
        if (showHelp || endpoint == null) {
            printUsage();
            return;
        }

        if (!useWebsockets) {
            if (certPath == null || keyPath == null) {
                printUsage();
                return;
            }
        } else if (useX509Credentials) {
            if (x509RoleAlias == null || x509Endpoint == null || x509Thing == null || x509CertPath == null || x509KeyPath == null) {
                printUsage();
                return;
            }
        }

        int retries = 100;
        int i = 0;
        while (i < retries){
            System.out.println("Retry number: "+ i);

            i++;

            MqttClientConnectionEvents callbacks = new MqttClientConnectionEvents() {
                @Override
                public void onConnectionInterrupted(int errorCode) {
                    if (errorCode != 0) {
                        System.out.println("Connection interrupted: " + errorCode + ": " + CRT.awsErrorString(errorCode));
                    }
                }

                @Override
                public void onConnectionResumed(boolean sessionPresent) {
                    System.out.println("Connection resumed: " + (sessionPresent ? "existing session" : "clean session"));
                }
            };

            try(EventLoopGroup eventLoopGroup = new EventLoopGroup(1);
                HostResolver resolver = new HostResolver(eventLoopGroup);
                ClientBootstrap clientBootstrap = new ClientBootstrap(eventLoopGroup, resolver);
                AwsIotMqttConnectionBuilder builder = AwsIotMqttConnectionBuilder.newMtlsBuilderFromPath(certPath, keyPath)) {

                if (rootCaPath != null) {
                    builder.withCertificateAuthorityFromPath(null, rootCaPath);
                }

                builder.withBootstrap(clientBootstrap)
                    .withConnectionEventCallbacks(callbacks)
                    .withClientId(clientId)
                    .withEndpoint(endpoint)
                    .withCleanSession(true);

                if (useWebsockets) {
                    builder.withWebsockets(true);
                    builder.withWebsocketSigningRegion(region);

                    HttpProxyOptions proxyOptions = null;
                    if (proxyHost != null && proxyPort > 0) {
                        proxyOptions = new HttpProxyOptions();
                        proxyOptions.setHost(proxyHost);
                        proxyOptions.setPort(proxyPort);

                        builder.withWebsocketProxyOptions(proxyOptions);
                    }

                    if (useX509Credentials) {
                        try (TlsContextOptions x509TlsOptions = TlsContextOptions.createWithMtlsFromPath(x509CertPath, x509KeyPath)) {
                            if (x509RootCaPath != null) {
                                x509TlsOptions.withCertificateAuthorityFromPath(null, x509RootCaPath);
                            }

                            try (ClientTlsContext x509TlsContext = new ClientTlsContext(x509TlsOptions)) {
                                X509CredentialsProvider.X509CredentialsProviderBuilder x509builder = new X509CredentialsProvider.X509CredentialsProviderBuilder()
                                        .withClientBootstrap(clientBootstrap)
                                        .withTlsContext(x509TlsContext)
                                        .withEndpoint(x509Endpoint)
                                        .withRoleAlias(x509RoleAlias)
                                        .withThingName(x509Thing)
                                        .withProxyOptions(proxyOptions);
                                try (X509CredentialsProvider provider = x509builder.build()) {
                                    builder.withWebsocketCredentialsProvider(provider);
                                }
                            }
                        }
                    }
                }

                try(MqttClientConnection connection = builder.build()) {

                    CompletableFuture<Boolean> connected = connection.connect();
                    try {
                        boolean sessionPresent = connected.get();
                        System.out.println("Connected to " + (!sessionPresent ? "new" : "existing") + " session!");
                    } catch (Exception ex) {
                        System.out.println("Exception occurred during connect: " + ex.getMessage());
                        continue;
                    }

                    CompletableFuture<Integer> subscribed = connection.subscribe(topic, QualityOfService.AT_LEAST_ONCE, (message) -> {
                        try {
                            String payload = new String(message.getPayload(), "UTF-8");
                            System.out.println("MESSAGE: " + payload);
                        } catch (UnsupportedEncodingException ex) {
                            System.out.println("Unable to decode payload: " + ex.getMessage());
                        }
                    });

                    subscribed.get();

                    int count = 0;
                    while (count++ < messagesToPublish) {
                        CompletableFuture<Integer> published = connection.publish(new MqttMessage(topic, message.getBytes()), QualityOfService.AT_LEAST_ONCE, false);
                        published.get();
                        Thread.sleep(1000);
                    }

                    CompletableFuture<Void> disconnected = connection.disconnect();
                    disconnected.get();
                }
            } catch (CrtRuntimeException | InterruptedException | ExecutionException ex) {
                System.out.println("Exception encountered: " + ex.toString());
            }
    }

        System.out.println("Complete!");
    }
}
MMaiero commented 4 years ago

Here the System.out

Retry number: 0
OpenJDK Client VM warning: You have loaded library /tmp/AWSCRT_15960183608311196483729881794089libaws-crt-jni.so which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
Exception occurred during connect: software.amazon.awssdk.crt.mqtt.MqttException: A query to dns failed to resolve.
Retry number: 1
Exception occurred during connect: software.amazon.awssdk.crt.mqtt.MqttException: A query to dns failed to resolve.
Retry number: 2
Fatal error condition occurred in /work/aws-common-runtime/aws-c-common/source/allocator.c:166: allocator != ((void *)0)
Exiting Application
################################################################################
Resolved stacktrace:
################################################################################
################################################################################
Raw stacktrace:
################################################################################
Aborted
bretambrose commented 4 years ago

Can you attach a trace log of a failing run? You can get a log by adding

-Daws.crt.log.level=Trace -Daws.crt.log.destination=File -Daws.crt.log.filename=/tmp/log.txt

to the command line of the sample run.

Given arm, I'm guessing this is linux-based; what distribution?

MMaiero commented 4 years ago

It's a Yocto based linux distribution. I have attached, as requested, the log. log.txt

Interestingly, the failure happens only if the reported exception is like this: Exception occurred during connect: software.amazon.awssdk.crt.mqtt.MqttException: A query to dns failed to resolve.

While, if the error reported is like this Exception occurred during connect: software.amazon.awssdk.crt.mqtt.MqttException: socket connect failure, no route to host., the code works fine without issues.

bretambrose commented 4 years ago

I have not been able to repro it so far, and neither disabling the adapter or unhooking it from the wall let me trap a dns resolution failure.

I think the best bet to track it down is with a debug build of the CRT, which I can help you get set up, and then we run your test application via gdb and hopefully get a nice stack trace of the crash. Normally we would be the ones doing this but given my current failure to repro and the unusual distribution/hardware combo, that is probably our best bet. Is that something you're interested/willing to work through?

MMaiero commented 4 years ago

sure, no problem.

bretambrose commented 4 years ago

The first step would be to get a debug build of the CRT.

Clone https://github.com/awslabs/aws-crt-java and follow the build instructions for linux, but before doing so, make the following change to the top-level pom.xml file:

Change line 36 from

    <cmake.buildtype>RelWithDebInfo</cmake.buildtype>

to

    <cmake.buildtype>Debug</cmake.buildtype>
MMaiero commented 4 years ago

I have tried to build first on a raspberry pi, but I get the following exception:

CMake Error at /usr/share/cmake-3.13/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
  Could NOT find LibCrypto (missing: LibCrypto_LIBRARY LibCrypto_INCLUDE_DIR)
Call Stack (most recent call first):
  /usr/share/cmake-3.13/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)
  aws-common-runtime/s2n/cmake/modules/FindLibCrypto.cmake:61 (find_package_handle_standard_args)
  aws-common-runtime/s2n/CMakeLists.txt:226 (find_package)

Any clue?

bretambrose commented 4 years ago

install libssl-dev and that should give you a static libcrypto

MMaiero commented 4 years ago

Ok, so, this is the execution result from standard output:

Retry number: 0
Exception occurred during connect: software.amazon.awssdk.crt.mqtt.MqttException: A query to dns failed to resolve.
Retry number: 1
Exception occurred during connect: software.amazon.awssdk.crt.mqtt.MqttException: A query to dns failed to resolve.
Retry number: 2
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00000012, pid=9508, tid=0xa29e3460
#
# JRE version: OpenJDK Runtime Environment (8.0_202-b152) (build 1.8.0_202-b152)
# Java VM: OpenJDK Client VM (25.202-b152 mixed mode linux-aarch32 )
# Problematic frame:
# C  0x00000012
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /tmp/hs_err_pid9508.log
Exception occurred during connect: software.amazon.awssdk.crt.mqtt.MqttException: A query to dns failed to resolve.
Retry number: 3
[thread -1583917984 also had an error]
#
# If you would like to submit a bug report, please visit:
#   http://www.azulsystems.com/support/
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Aborted

And the log files produced: hs_err_pid9508.log log.txt

bretambrose commented 4 years ago

Just to make sure, did you maven install the resulting crt and point the v2 sdk at the installed version: SNAPSHOT-1.0.0? You would also need to rebuild the v2 SDK and install it as well and point your sample at the locally built sdk (also SNAPSHOT-1.0.0). Otherwise you'll still be using the maven packages in release mode.

Full steps

In the crt directory:

  1. git clone
  2. git submodule update --init
  3. edit the pom.xml to build Debug when running cmake
  4. mvn compile
  5. mvn install -DskipTests=true

In the sdk directory:

  1. git clone
  2. edit the sdk's pom.xml to use crt version SNAPSHOT-1.0.0
  3. mvn compile
  4. mvn install

In your application's directory:

  1. edit the pom.xml to use sdk version SNAPSHOT-1.0.0
  2. mvn compile

Assuming you've switched over everything properly, the next step is to run your app from gdb and wait for the crash.

Approximate instructions

  1. gdb
  2. target exec [program-name]
  3. set args [arguments]
  4. run

You may also need to run this at the gdb prompt:

handle SIGSEGV nostop noprint

as the jvm will generate a lot of spurious seg fault signals during the normal course of execution.

I don't think this will work if you're running via maven. In that case, you'll need to figure out the low-level 'java' command line that it invokes for you first.

MMaiero commented 4 years ago

I was running my example like this:

java -cp BasicPubSub-1.0-SNAPSHOT.jar:aws-crt.jar:aws-iot-device-sdk.jar pubsub/PubSub -e endpoint.iot.us-east-1.amazonaws.com -p 8883 -r /tmp/awsRootCA.cert -c /tmp/device-certificate.pem.crt -k /tmp/device-Key.pem

where aws-crt is the one build with debug enabled and the aws-iot-device-sdk was in version 1.2.5.

I'll redo the check following your suggestions

MMaiero commented 4 years ago

Hello, I've tried to run on the current master for both the sdk and the art lib and I see the same effect but with a slightly different exception:

GNU gdb (GDB) 8.2
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "arm-poky-linux-gnueabi".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 23322
[New LWP 23323]
[New LWP 23324]
[New LWP 23325]
[New LWP 23326]
[New LWP 23327]
[New LWP 23328]
[New LWP 23329]
[New LWP 23330]

warning: File "/lib/libthread_db-1.0.so" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
    add-auto-load-safe-path /lib/libthread_db-1.0.so
line to your configuration file "/home/root/.gdbinit".
To completely disable this security protection add
    set auto-load safe-path /
line to your configuration file "/home/root/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
    info "(gdb)Auto-loading safe path"

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.

warning: File "/lib/libthread_db-1.0.so" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
0xb6f48954 in ?? () from /lib/libpthread.so.0
(gdb) handle SIGSEGV nostop noprint pass
Signal        Stop  Print   Pass to program Description
SIGSEGV       No    No  Yes     Segmentation fault
(gdb) cont
Continuing.

Thread 2 "java" received signal SIGILL, Illegal instruction.
[Switching to LWP 23323]
0xa3b44728 in _armv7_tick () from /tmp/AWSCRT_15965367363287590319426680121059libaws-crt-jni.so
(gdb) backtrace
#0  0xa3b44728 in _armv7_tick () from /tmp/AWSCRT_15965367363287590319426680121059libaws-crt-jni.so
#1  0xa399a18c in OPENSSL_cpuid_setup () from /tmp/AWSCRT_15965367363287590319426680121059libaws-crt-jni.so
#2  0xb6f79c68 in ?? () from /lib/ld-linux-armhf.so.3
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) continue
Continuing.
[New LWP 473]
[New LWP 474]
[New LWP 475]
[New LWP 476]
[LWP 473 exited]
[LWP 476 exited]
[New LWP 477]
[New LWP 478]
[New LWP 479]
[LWP 479 exited]
[LWP 475 exited]

Thread 11 "java" received signal SIGABRT, Aborted.
[Switching to LWP 474]
0xb6df83b8 in raise () from /lib/libc.so.6
(gdb) backtrace
#0  0xb6df83b8 in raise () from /lib/libc.so.6
#1  0xb6de4208 in abort () from /lib/libc.so.6
#2  0xb6bc53e8 in ?? () from /usr/bin/zulu8.36.0.152-sa-jre1.8.0_202-linux_aarch32hf/lib/aarch32/client/libjvm.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) quit
A debugging session is active.

    Inferior 1 [process 23322] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/bin/zulu8.36.0.152-sa-jre1.8.0_202-linux_aarch32hf/bin/java, process 23322
[Inferior 1 (process 23322) detached]

and the following associated standard output:

Retry number: 0
Exception occurred during connect: software.amazon.awssdk.crt.mqtt.MqttException: A query to dns failed to resolve.
Retry number: 1
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00000012, pid=23322, tid=0xa3119460
#
# JRE version: OpenJDK Runtime Environment (8.0_202-b152) (build 1.8.0_202-b152)
# Java VM: OpenJDK Client VM (25.202-b152 mixed mode linux-aarch32 )
# Problematic frame:
# C  0x00000012
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /tmp/hs_err_pid23322.log
Exception occurred during connect: software.amazon.awssdk.crt.mqtt.MqttException: A query to dns failed to resolve.
Retry number: 2
[thread -1575914400 also had an error]
#
# If you would like to submit a bug report, please visit:
#   http://www.azulsystems.com/support/
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Aborted

JVM launched with the following command: java -cp BasicPubSub-1.0-SNAPSHOT.jar:aws-crt-1.0.0-SNAPSHOT.jar:aws-iot-device-sdk-1.0.0-SNAPSHOT.jar pubsub/PubSub -e endpoint.amazonaws.com -p 8883 -r /tmp/awsRootCA.cert -c /tmp/certificate.pem.crt -k /tmp/outKey.pem

Executing the same with the current version of the crt (0.5.8), this is what I get:

GNU gdb (GDB) 8.2
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "arm-poky-linux-gnueabi".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 3072
[New LWP 3073]
[New LWP 3105]
[New LWP 3107]
[New LWP 3108]
[New LWP 3128]
[New LWP 3130]
[New LWP 3136]
[New LWP 3137]

warning: File "/lib/libthread_db-1.0.so" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
    add-auto-load-safe-path /lib/libthread_db-1.0.so
line to your configuration file "/home/root/.gdbinit".
To completely disable this security protection add
    set auto-load safe-path /
line to your configuration file "/home/root/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
    info "(gdb)Auto-loading safe path"

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.

warning: File "/lib/libthread_db-1.0.so" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
0xb6f43954 in ?? () from /lib/libpthread.so.0
(gdb) handle SIGSEGV nostop noprint pass
Signal        Stop  Print   Pass to program Description
SIGSEGV       No    No  Yes     Segmentation fault
(gdb) cont
Continuing.
[New LWP 3156]
[New LWP 3157]
[LWP 3157 exited]
[New LWP 3158]
[New LWP 3159]
[LWP 3156 exited]
[LWP 3158 exited]
[New LWP 3160]
[New LWP 3161]
[New LWP 3162]
[LWP 3162 exited]
[LWP 3159 exited]
[New LWP 3163]
[LWP 3163 exited]
[New LWP 3164]
[New LWP 3165]
[LWP 3161 exited]
[LWP 3164 exited]

Thread 14 "java" received signal SIGABRT, Aborted.
[Switching to LWP 3160]
0xb6df33b8 in raise () from /lib/libc.so.6
(gdb) backtrace
#0  0xb6df33b8 in raise () from /lib/libc.so.6
#1  0xb6ddf208 in abort () from /lib/libc.so.6
#2  0xa3c07b00 in aws_fatal_assert () from /tmp/AWSCRT_15965427092165300894669209383845libaws-crt-jni.so
#3  0xa3c07434 in aws_mem_release () from /tmp/AWSCRT_15965427092165300894669209383845libaws-crt-jni.so
#4  0xa3a65a84 in resolver_thread_fn () from /tmp/AWSCRT_15965427092165300894669209383845libaws-crt-jni.so
#5  0xa3c11338 in thread_fn () from /tmp/AWSCRT_15965427092165300894669209383845libaws-crt-jni.so
#6  0xb6f423c0 in ?? () from /lib/libpthread.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

with the following standard out:

Retry number: 0
OpenJDK Client VM warning: You have loaded library /tmp/AWSCRT_15965427092165300894669209383845libaws-crt-jni.so which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
Exception occurred during connect: software.amazon.awssdk.crt.mqtt.MqttException: A query to dns failed to resolve.
Retry number: 1
Exception occurred during connect: software.amazon.awssdk.crt.mqtt.MqttException: A query to dns failed to resolve.
Retry number: 2
Fatal error condition occurred in /work/aws-common-runtime/aws-c-common/source/allocator.c:166: allocator != ((void *)0)
Exiting Application
Exception occurred during connect: software.amazon.awssdk.crt.mqtt.MqttException: A query to dns failed to resolve.
################################################################################
Resolved stacktrace:
################################################################################
################################################################################
Raw stacktrace:
################################################################################
Retry number: 3

Hope this helps

bretambrose commented 4 years ago

the second crash looks promising. Can you do a 'frame 4' (resolver_thread_fn) and then a list to get the line?

MMaiero commented 4 years ago
(gdb) backtrace
#0  0xb6de83b8 in raise () from /lib/libc.so.6
#1  0xb6dd4208 in abort () from /lib/libc.so.6
#2  0xa3c07b00 in aws_fatal_assert () from /tmp/AWSCRT_15966106053523618899268721628563libaws-crt-jni.so
#3  0xa3c07434 in aws_mem_release () from /tmp/AWSCRT_15966106053523618899268721628563libaws-crt-jni.so
#4  0xa3a65a84 in resolver_thread_fn () from /tmp/AWSCRT_15966106053523618899268721628563libaws-crt-jni.so
#5  0xa3c11338 in thread_fn () from /tmp/AWSCRT_15966106053523618899268721628563libaws-crt-jni.so
#6  0xb6f373c0 in ?? () from /lib/libpthread.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) frame 4
#4  0xa3a65a84 in resolver_thread_fn () from /tmp/AWSCRT_15966106053523618899268721628563libaws-crt-jni.so
(gdb) list
1   crypto/asn1/a_strex.c: No such file or directory.
(gdb) info f 4
Stack frame at 0xa318ee40:
 pc = 0xa3a65a84 in resolver_thread_fn; saved pc = 0xa3c11338
 called by frame at 0xa318ee70, caller of frame at 0xa318ed28
 Arglist at 0xa318ed28, args:
 Locals at 0xa318ed28, Previous frame's sp is 0xa318ee40
 Saved registers:
  r4 at 0xa318ee1c, r5 at 0xa318ee20, r6 at 0xa318ee24, r7 at 0xa318ee28, r8 at 0xa318ee2c, r9 at 0xa318ee30, r10 at 0xa318ee34, r11 at 0xa318ee38, lr at 0xa318ee3c
bretambrose commented 4 years ago

I was hoping this would give us a good clue, but to be honest, it doesn't really make sense. That being said, I am aware of some potential race conditions with host resolver shutdown and so conceivably this may be an instance of that (but the fact that it happens every single time for you on this platform is kind of weird). I am in the process of reworking some of our networking primitives to have a cleaner, safer (once debugged) asynchronous shutdown flow, and the host resolver (which is the primary culprit here) is a big part of that. I don't have an estimate when it will be done (possibly a week at best) and of course I also am just speculating that it might fix this crash. In the absence of root-causing this crash, I will update this issue when the new ref count and asynchronous shutdown framework is live so that you can re-test if still interested.

JonathanHenson commented 4 years ago

One thing you can do to try and reproduce the crash deterministically is to have the host resolver return an empty set of addresses. That would come close to reproducing the conditions described here.

MMaiero commented 4 years ago

I was hoping this would give us a good clue, but to be honest, it doesn't really make sense. That being said, I am aware of some potential race conditions with host resolver shutdown and so conceivably this may be an instance of that (but the fact that it happens every single time for you on this platform is kind of weird). I am in the process of reworking some of our networking primitives to have a cleaner, safer (once debugged) asynchronous shutdown flow, and the host resolver (which is the primary culprit here) is a big part of that. I don't have an estimate when it will be done (possibly a week at best) and of course I also am just speculating that it might fix this crash. In the absence of root-causing this crash, I will update this issue when the new ref count and asynchronous shutdown framework is live so that you can re-test if still interested.

Hi, thank you for the feedback. Please let me know when you have something to give a try. I'll be happy to test it in my environment.

MMaiero commented 4 years ago

Hi @bretambrose, do you have any update on this issue?

bretambrose commented 4 years ago

Still in progress. I need to finish updating and testing the js and python CRTs and then get everything reviewed. It is theoretically possible to try out the changes yourself by using the 'ref' branch of aws-crt-java but it might be best to wait until things are finalized.

bretambrose commented 4 years ago

V1.2.8 of the SDK has been published to maven and should be visible shortly. Please give it a try and report if stability issues have improved.

MMaiero commented 4 years ago

Hi, thanks for the update. I'm going to test it today and I'll let you know.

MMaiero commented 4 years ago

I can confirm that it seems to fix the issue reported. Thanks for the fix!