aws / s2n-tls

An implementation of the TLS/SSL protocols
https://aws.github.io/s2n-tls/usage-guide/
Apache License 2.0
4.53k stars 709 forks source link

error open urandom #3445

Open RoyZhang2022 opened 2 years ago

RoyZhang2022 commented 2 years ago

Security issue notifications

If you discover a potential security issue in s2n we ask that you notify AWS Security via our vulnerability reporting page. Please do not create a public github issue.

Problem:

I am using AWS SDK to develop an ODBC driver. Our code is running on github. The test works fine locally on Ubuntu 20.04. It also could run successfully until Aug 9. But from yesterday the test on github starts to fail like below. It is from Aws::InitAPI(). As it is AWS SDK code we have no idea what is wrong. Just from the error info it seems /dev/urandom could not be accessed.

One more thing we noticed is the aws-c-io which calls s2n is updated from 0.10.7 to 0.10.22. Maybe that caused this problem.

--error info 2022-08-12T01:51:39.3781517Z s2n_init() failed: 402653198 (error opening urandom) 2022-08-12T01:51:39.3782986Z Fatal error condition occurred in /home/runner/work/amazon-timestream-odbc-driver/amazon-timestream-odbc-driver/vcpkg/buildtrees/aws-c-io/src/f2a46b436e-18f356d7b4.clean/source/s2n/s2n_tls_channel_handler.c:187: 0 && "s2n_init() failed"

--call stack 2022-08-12T01:51:39.3864476Z /home/runner/work/amazon-timestream-odbc-driver/amazon-timestream-odbc-driver/src/../build/odbc/lib/libtimestream-odbc.so.2.13.0.58917(aws_backtrace_print+0xba) [0x7f0212002fb8] 2022-08-12T01:51:39.3865476Z /home/runner/work/amazon-timestream-odbc-driver/amazon-timestream-odbc-driver/src/../build/odbc/lib/libtimestream-odbc.so.2.13.0.58917(aws_fatal_assert+0x5f) [0x7f0211feea7e] 2022-08-12T01:51:39.3866464Z /home/runner/work/amazon-timestream-odbc-driver/amazon-timestream-odbc-driver/src/../build/odbc/lib/libtimestream-odbc.so.2.13.0.58917(aws_tls_init_static_state+0x115) [0x7f0211e4d981] 2022-08-12T01:51:39.3867438Z /home/runner/work/amazon-timestream-odbc-driver/amazon-timestream-odbc-driver/src/../build/odbc/lib/libtimestream-odbc.so.2.13.0.58917(aws_io_library_init+0x61) [0x7f0211e40f1a] 2022-08-12T01:51:39.3868405Z /home/runner/work/amazon-timestream-odbc-driver/amazon-timestream-odbc-driver/src/../build/odbc/lib/libtimestream-odbc.so.2.13.0.58917(aws_mqtt_library_init+0x31) [0x7f0211dc91e3] 2022-08-12T01:51:39.3869332Z /home/runner/work/amazon-timestream-odbc-driver/amazon-timestream-odbc-driver/src/../build/odbc/lib/libtimestream-odbc.so.2.13.0.58917(+0xdbce06) [0x7f0211dabe06] 2022-08-12T01:51:39.3870707Z /home/runner/work/amazon-timestream-odbc-driver/amazon-timestream-odbc-driver/src/../build/odbc/lib/libtimestream-odbc.so.2.13.0.58917(_ZN3Aws3Crt9ApiHandleC1EP13aws_allocator+0x4e) [0x7f0211dabea6] 2022-08-12T01:51:39.3871805Z /home/runner/work/amazon-timestream-odbc-driver/amazon-timestream-odbc-driver/src/../build/odbc/lib/libtimestream-odbc.so.2.13.0.58917(_ZN3Aws3NewINS_3Crt9ApiHandleEJP13aws_allocatorEEEPTPKcDpOT0+0x5a) [0x7f021198d787] 2022-08-12T01:51:39.3872982Z /home/runner/work/amazon-timestream-odbc-driver/amazon-timestream-odbc-driver/src/../build/odbc/lib/libtimestream-odbc.so.2.13.0.58917(_ZN3Aws13InitializeCrtEv+0x37) [0x7f021198d3df] 2022-08-12T01:51:39.3873996Z /home/runner/work/amazon-timestream-odbc-driver/amazon-timestream-odbc-driver/src/../build/odbc/lib/libtimestream-odbc.so.2.13.0.58917(_ZN3Aws7InitAPIERKNS_10SDKOptionsE+0x2b) [0x7f0211989470]

A short description of what the problem is and why we need to fix it. Add reproduction steps if necessary. This is an error from aws-c-io and s2n. It is not controlled by outside. Seems a bug here. Please check.

Solution:

A description of the possible solution in terms of S2N architecture. Highlight and explain any potentially controversial design decisions taken.

Requirements / Acceptance Criteria:

What must a solution address in order to solve the problem? How do we know the solution is complete?

Out of scope:

Is there anything the solution will intentionally NOT address?

RoyZhang2022 commented 2 years ago

There is still this problem after I restore aws-c-io to 0.10.7.

franklee26 commented 2 years ago

Hey Roy, were there any other changes to your system that may have affected /dev/urandom? If the error was caused by a code change in s2n-tls, we would expect rolling back to your old version to resolve the error.

Also, if you're using s2n-tls 0.10.7, that's a pre-release version before our 1.0.0 release. Are you using an old version of aws-c-io? Is it possible to upgrade to a version that uses a more recent s2n-tls version (see https://github.com/aws/s2n-tls/releases)?

RoyZhang2022 commented 2 years ago

Hi Frank,

Thanks for your reply! I have tried this example code to github host and /dev/urandom could be accessied without a problem. We are developing application level code and do not touch low level like /dev/urandom. Currently we are unsure how it is introduced. Seems an environment problem, but the urandom test code could run successfully. The aws-c-io is 0.6.20. We also restored it to 0.6.9. But that still could not solve the problem. We have not changed anything. Just from 3 days ago this error emerged and became a blocking issue. Even we restored aws-c-io and s2n-tls to the previous successful one.

--test code https://gist.github.com/Keith-S-Thompson/4704720

Currently the error S2N_ERR_OPEN_RANDOM could be thrown from two functions. One is s2n_rand_init_impl() and the other one is s2n_rand_init(). As the above test code could succeed, we are suspecting the error is from s2n_rand_init(). Do you have a way to identity which line generates this error?

352 static int s2n_rand_init_impl(void)
353 {
354   OPEN:
355     entropy_fd = open(ENTROPY_SOURCE, O_RDONLY);
356     if (entropy_fd == -1) {
357         if (errno == EINTR) {
358             goto OPEN;
359         }
360         POSIX_BAIL(S2N_ERR_OPEN_RANDOM);
361     }
362
363     if (s2n_cpu_supports_rdrand()) {
364        s2n_rand_mix_cb = s2n_rand_rdrand_impl;
365     }
366
367     return S2N_SUCCESS;
368 }
369
370 S2N_RESULT s2n_rand_init(void)
371 {
372     RESULT_GUARD_POSIX(s2n_rand_init_cb());
373
374     RESULT_GUARD(s2n_ensure_initialized_drbgs());
375
376 #if S2N_LIBCRYPTO_SUPPORTS_CUSTOM_RAND
377     /* Create an engine */
378     ENGINE *e = ENGINE_new();
379
380     RESULT_ENSURE(e != NULL, S2N_ERR_OPEN_RANDOM);
381     RESULT_GUARD_OSSL(ENGINE_set_id(e, S2N_RAND_ENGINE_ID), S2N_ERR_OPEN_RANDOM);
382     RESULT_GUARD_OSSL(ENGINE_set_name(e, "s2n entropy generator"), S2N_ERR_OPEN_RANDOM);
383     RESULT_GUARD_OSSL(ENGINE_set_flags(e, ENGINE_FLAGS_NO_REGISTER_ALL), S2N_ERR_OPEN_RANDOM);
384     RESULT_GUARD_OSSL(ENGINE_set_init_function(e, s2n_openssl_compat_init), S2N_ERR_OPEN_RANDOM);
385     RESULT_GUARD_OSSL(ENGINE_set_RAND(e, &s2n_openssl_rand_method), S2N_ERR_OPEN_RANDOM);
386     RESULT_GUARD_OSSL(ENGINE_add(e), S2N_ERR_OPEN_RANDOM);
387     RESULT_GUARD_OSSL(ENGINE_free(e) , S2N_ERR_OPEN_RANDOM);
388
389     /* Use that engine for rand() */
390     e = ENGINE_by_id(S2N_RAND_ENGINE_ID);
391     RESULT_ENSURE(e != NULL, S2N_ERR_OPEN_RANDOM);
392     RESULT_GUARD_OSSL(ENGINE_init(e), S2N_ERR_OPEN_RANDOM);
393     RESULT_GUARD_OSSL(ENGINE_set_default(e, ENGINE_METHOD_RAND), S2N_ERR_OPEN_RANDOM);
394     RESULT_GUARD_OSSL(ENGINE_free(e), S2N_ERR_OPEN_RANDOM);
395 #endif
396
397     return S2N_RESULT_OK;
398 }
RoyZhang2022 commented 2 years ago

I also tried the newest s2n 1.3.19. Still same issue. Seems an environment problem. But I could not understand the test code could succeed, that means /dev/urandom could be accessed and works normally.

lrstewart commented 2 years ago

We have an API that can report the line of code that generated the error: https://github.com/aws/s2n-tls/blob/main/api/s2n.h#L384-L394 Are you able to call that?

And are you calling s2n_init() directly in your code? If so, is it possible you're calling s2n_init() more than once?

RoyZhang2022 commented 2 years ago

We are not calling s2n_init() directly. We just called AWS SDK initAPI as below. All these are done implicitly.

  Aws::InitAPI(options_);
RoyZhang2022 commented 2 years ago

We have two executables which calls Aws::InitAPI(). They are executed sequentially. We have called Aws::ShutdownAPI() before each executable exits. The problem is always seen for the 2nd executable even if I disabled the first executable. Could not understand why. But today I saw it ran successfully. It is the first time for the recent 3 days. But it is only once.

lrstewart commented 2 years ago

Interesting. That could suggest an issue with cleanup.

Just looking at our code, we close urandom in s2n_rand_cleanup, which should get triggered by s2n_cleanup unless s2n_disable_atexit is called.

We should probably check whether the SDK calls s2n_disable_atexit. We probably also need a new s2n-tls test: it looks like the current "repeatedly call s2n_init and s2n_cleanup" test calls s2n_disable_atexit.

camshaft commented 2 years ago

The SDK does call s2n_disable_atexit. In fact, that call actually was added to support the SDK: #3011.

RoyZhang2022 commented 2 years ago

I think it is not related with close urandom as the problem could be seen even when we ran the 2nd executable alone. So for this case there is no urandom use before s2n_init() is called. I think the error may be from the s2n_rand_init() which uses openssl Engine_new().

352 static int s2n_rand_init_impl(void) 353 { 354 OPEN: 355 entropy_fd = open(ENTROPY_SOURCE, O_RDONLY); 356 if (entropy_fd == -1) { 357 if (errno == EINTR) { 358 goto OPEN; 359 } 360 POSIX_BAIL(S2N_ERR_OPEN_RANDOM); 361 } 362 363 if (s2n_cpu_supports_rdrand()) { 364 s2n_rand_mix_cb = s2n_rand_rdrand_impl; 365 } 366 367 return S2N_SUCCESS; 368 } 369 370 S2N_RESULT s2n_rand_init(void) 371 { 372 RESULT_GUARD_POSIX(s2n_rand_init_cb()); 373 374 RESULT_GUARD(s2n_ensure_initialized_drbgs()); 375 376 #if S2N_LIBCRYPTO_SUPPORTS_CUSTOM_RAND 377 / Create an engine / 378 ENGINE e = ENGINE_new(); 379 380 RESULT_ENSURE(e != NULL, S2N_ERR_OPEN_RANDOM); 381 RESULT_GUARD_OSSL(ENGINE_set_id(e, S2N_RAND_ENGINE_ID), S2N_ERR_OPEN_RANDOM); 382 RESULT_GUARD_OSSL(ENGINE_set_name(e, "s2n entropy generator"), S2N_ERR_OPEN_RANDOM); 383 RESULT_GUARD_OSSL(ENGINE_set_flags(e, ENGINE_FLAGS_NO_REGISTER_ALL), S2N_ERR_OPEN_RANDOM); 384 RESULT_GUARD_OSSL(ENGINE_set_init_function(e, s2n_openssl_compat_init), S2N_ERR_OPEN_RANDOM); 385 RESULT_GUARD_OSSL(ENGINE_set_RAND(e, &s2n_openssl_rand_method), S2N_ERR_OPEN_RANDOM); 386 RESULT_GUARD_OSSL(ENGINE_add(e), S2N_ERR_OPEN_RANDOM); 387 RESULT_GUARD_OSSL(ENGINE_free(e) , S2N_ERR_OPEN_RANDOM); 388 389 / Use that engine for rand() */ 390 e = ENGINE_by_id(S2N_RAND_ENGINE_ID); 391 RESULT_ENSURE(e != NULL, S2N_ERR_OPEN_RANDOM); 392 RESULT_GUARD_OSSL(ENGINE_init(e), S2N_ERR_OPEN_RANDOM); 393 RESULT_GUARD_OSSL(ENGINE_set_default(e, ENGINE_METHOD_RAND), S2N_ERR_OPEN_RANDOM); 394 RESULT_GUARD_OSSL(ENGINE_free(e), S2N_ERR_OPEN_RANDOM); 395 #endif 396 397 return S2N_RESULT_OK; 398 }

RoyZhang2022 commented 2 years ago

Hi,

I have identified the error is generated by ENGINE_add(e) from s2n_rand_init(). Is there a way to get the details from ENGINE_add(e)?

370 S2N_RESULT s2n_rand_init(void) 371 { 372 RESULT_GUARD_POSIX(s2n_rand_init_cb()); 373 374 RESULT_GUARD(s2n_ensure_initialized_drbgs()); 375 376 #if S2N_LIBCRYPTO_SUPPORTS_CUSTOM_RAND 377 / Create an engine / 378 ENGINE e = ENGINE_new(); 379 380 RESULT_ENSURE(e != NULL, S2N_ERR_OPEN_RANDOM); 381 RESULT_GUARD_OSSL(ENGINE_set_id(e, S2N_RAND_ENGINE_ID), S2N_ERR_OPEN_RANDOM); 382 RESULT_GUARD_OSSL(ENGINE_set_name(e, "s2n entropy generator"), S2N_ERR_OPEN_RANDOM); 383 RESULT_GUARD_OSSL(ENGINE_set_flags(e, ENGINE_FLAGS_NO_REGISTER_ALL), S2N_ERR_OPEN_RANDOM); 384 RESULT_GUARD_OSSL(ENGINE_set_init_function(e, s2n_openssl_compat_init), S2N_ERR_OPEN_RANDOM); 385 RESULT_GUARD_OSSL(ENGINE_set_RAND(e, &s2n_openssl_rand_method), S2N_ERR_OPEN_RANDOM); 386 RESULT_GUARD_OSSL(ENGINE_add(e), S2N_ERR_OPEN_RANDOM); 387 RESULT_GUARD_OSSL(ENGINE_free(e) , S2N_ERR_OPEN_RANDOM);

RoyZhang2022 commented 2 years ago

From openssl ENGINE_add() code, the error is possible from CRYPTO_THREAD_write_lock(global_engine_lock) as the e is not NULL and it has id and name set. Also it is not added more than once. The CRYPTO_THREAD_write_lock(global_engine_lock) failure seems be caused by global_engine_lock is 0.

int ENGINE_add(ENGINE *e)
{
    int to_return = 1;
    if (e == NULL) {
        ERR_raise(ERR_LIB_ENGINE, ERR_R_PASSED_NULL_PARAMETER);
        return 0;
    }
    if ((e->id == NULL) || (e->name == NULL)) {
        ERR_raise(ERR_LIB_ENGINE, ENGINE_R_ID_OR_NAME_MISSING);
        return 0;
    }
    **if (!CRYPTO_THREAD_write_lock(global_engine_lock))**
        return 0;
    if (!engine_list_add(e)) {
        ERR_raise(ERR_LIB_ENGINE, ENGINE_R_INTERNAL_LIST_ERROR);
        to_return = 0;
    }
    CRYPTO_THREAD_unlock(global_engine_lock);
    return to_return;
}
lrstewart commented 2 years ago

If the error is coming from openssl, can you use the openssl error methods to examine it? Something like printf("%s", ERR_error_string(ERR_get_error(), NULL)) right after the failure.

RoyZhang2022 commented 2 years ago

I got this error

error:13000067:engine routines::conflicting engine id

RoyZhang2022 commented 2 years ago

Is it possible s2n_rand_init() is executed in multiple threads?

lrstewart commented 2 years ago

Are you calling InitAPI in multiple threads? s2n_init can only be called once across all threads, so that would be an issue if the SDK doesn't have extra protection. It looks like they have something, but I'm not sure to what extent.

RoyZhang2022 commented 2 years ago

It may be in multiple threads. But we have a protection to run InitAPI only once.

  // The AWS SDK for C++ must be initialized by calling Aws::InitAPI.
  // It should only be initialized only once during the application running
  // All Connections in different thread must wait before the InitAPI is finished.
  if (!awsSDKReady_) {
    // Use awsSDKReady_ and mutex_ to guarantee InitAPI is executed before all 
    // Connection objects start to run.
    std::lock_guard< std::mutex > lock(mutex_);
    if (!awsSDKReady_) {
      options_.loggingOptions.logLevel = Aws::Utils::Logging::LogLevel::Warn;

      Aws::InitAPI(options_);
      awsSDKReady_ = true;
    }
  }
lrstewart commented 2 years ago

Do you know what version of Openssl you're using as libcrypto? If not, can you add printf("%s", SSLeay_version(SSLEAY_VERSION)) to your code? And is this now building with the latest s2n-tls version? There are some fun locking issues with Openssl-1.0.2 we should have fixed in a recent change.

Edit: Although that actually shouldn't affect s2n_rand_init, because we initialize the locking callbacks after we call s2n_rand_init. That's probably a potential bug we need to fix, even if it isn't related to this issue.

RoyZhang2022 commented 2 years ago

I added the below code. I see ENGINE_add() is called multiple times. The openssl version is 3.0.5.

387 fprintf(stderr, "#### ENGINE_add() will be called\n"); 388 fprintf(stderr, "%s\n", SSLeay_version(SSLEAY_VERSION));

ENGINE_add() will be called

OpenSSL 3.0.5 5 Jul 2022

ENGINE_add() will be called

OpenSSL 3.0.5 5 Jul 2022

lrstewart commented 2 years ago

That's an unexpected development. s2n-tls doesn't actually support openssl-3.0 yet-- I'm curious how you're building with it. The known build failures are documented here, and we have a developer working on support here.

lrstewart commented 2 years ago

Ah but that's probably to do with how you're building s2n-tls vs how you're building your code and where you're calling the method to get the version.

There is this issue with the SDK: https://github.com/awslabs/aws-c-io/pull/508

RoyZhang2022 commented 2 years ago

We do not use s2n-tls or aws-c-io directly. We just use AWS SDK and download all dependencies by vcpkg. Not sure why openssl-3.0 is used.

For the fprintf info I put them to s2n_rand_init().

lrstewart commented 2 years ago

All your debug info points to s2n_init being called twice, which we know (and document) is illegal and will fail. It's likely that something in your application code or one of the libraries consuming s2n-tls is leading to multiple calls to s2n_init.

At this point I'm not sure I can offer any advice specific to s2n-tls, but if I were debugging your code I would:

  1. Rule out multiple threads as the issue. Run your code with only one thread.
  2. Compile (probably your code + aws-c-io to start) with debug symbols, run it with gdb, and use "info stack" to trace where the second s2n_init call is coming from. If you can't compile with debug symbols you could probably continue to use print statements to trace the duplicate calls up the stack.

You could also try reaching out to aws-c-io / the SDK.

RoyZhang2022 commented 2 years ago

Thanks for your suggestion. Let me try.

RoyZhang2022 commented 2 years ago

I think I have identified the root cause. We did call Aws::InitAPI multiple times. But sequentially, not parallelly. We called the Aws::InitAPI in our constructor with mutex protection for multi-threads. We called the Aws::ShutdownAPI in our destructor when no such objects are in use. We used a static int as a counter. But our tests are executed one by one and each one creates an object and destructs it when the test is finished. So we have a lot of Aws::InitAPI--> Aws::ShutdownAPI execution. Seems openssl ENGINE is not released in a sync mode, but in an async mode. The ShutdownAPI returns but the resource is not released. So we could see this problem.

I think you could write a test to run s2n_init and free in a loop to reproduce the problem. Sorry I could not build s2n-tls successfully on my local side because gnutls configure always fails due to Libnettle3.6 not found issue no matter what I do.

lrstewart commented 2 years ago

I think you could write a test to run s2n_init and free in a loop to reproduce the problem.

We have a test like that: https://github.com/aws/s2n-tls/blob/main/tests/unit/s2n_init_test.c

RoyZhang2022 commented 2 years ago

By adding debug info to function that calls ENGINE_remove(), I see there is one ENGINE_remove is not executed which causes this problem. It is related with our multi-threads execution but we have already protection for only once Aws::InitAPI() and Aws::ShutdownAPI() execution. Need to investigate the root cause behind that. But at least we have a workround.