ETLCPP / etl

Embedded Template Library
https://www.etlcpp.com
MIT License
2.2k stars 391 forks source link

undefined behaviour and memory issues. #536

Closed melg8 closed 2 years ago

melg8 commented 2 years ago

What happened

Hi, you've created a great library! But i discovered that source code of tests and maybe source code of library itself contains multiple instances of at least two kinds of problems:

  1. undefined behavior
  2. memory issues.

How to reproduce

It can be easily reproduced by addition of "-fsanitize=address,undefined" to CMAKE_CXX_FLAGS on every supported compiler. As for example:

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=address,undefined")

Or for single type of check:

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=undefined")

You can checkout concrete example for gcc/clang from this commit

You can see failing results in github actions ci.

Reports example

Address 0x7ffe36665b90 is located in stack of thread T0 at offset 928 in frame

0 0x55db03f744a5 in (anonymous namespace)::Suitetest_callback_timer_interrupt::Testcallback_timer_interrupt_log_timer_calls::RunImpl() const (/home/runner/work/etl/etl/test/etl_tests+0xadcf4a5)



# Why it matters
Brief analysis of issues in repository suggested, that some tools already warned at least about undefined behavior elements in test code: [array index issue](https://github.com/ETLCPP/etl/issues/248) but were explicitly [ignored](https://github.com/ETLCPP/etl/blob/5468eb659c8b4ecdb3d08e8b8f0442c5a2549a48/test/test_array_wrapper.cpp#L35).
Even though this particular linked issue isn't a big deal by itself. But it can have negative effects on final code quality:

   1. New versions of compilers have optimization strategies based on assumption, that program isn't ill-formed and doesn't have undefined behavior code. Under that assumption, the optimizer can optimize away some checks inside tests, so they will never have a chance to report when some regression bug happens.
   2. Ignoring a whole category of ub bugs in the assumption that it only contains "silly" stuff from "too pedantic" check inside of test code - can hide some real bugs in code.
   3. stl/boost and other well established libraries on market all can pass its tests without ubsan/asan errors. Some organizations have policies to have this checks on and passing. So some of potential users couldn't use this library.
   4. Any memory related bugs can prohibit usage of fuzzing techniques because they usually use sanitizers too.
   5. I've started poking around with these flags after I encountered, that tests doesn't pass/or hang under my machine using gcc10.3 and gcc 11.2 and clang 13.0.1. And they have different behavior under different compilers. So my guess, that some issues found by asan/ubsan - are real problems(in tests themselves or in library code).

ping @jwellbelove
jwellbelove commented 2 years ago

Thanks, I'll take a look at the issues you've raised.

jwellbelove commented 2 years ago

I get some really odd errors that I'm not sure how resolve, such as...

test_shared_message.cpp:279:92: runtime error: member call on address 0x7fffc1004180 which does not point 
to an object of type 'reference_counted_message'
0x7fffc1004180: note: object is of type 'etl::ireference_counted_message'

reference_counted_message is derived from etl::ireference_counted_message. ireference_counted_message is a pure virtual base class.

The error is generated from just constructing a reference_counted_message.

melg8 commented 2 years ago

@jwellbelove can you provide on which branch/commit and with which compiler you get this particular error (and with which flag exactly)? So i can try reproduce it locally and further investigate?

So far i find similar issue triggered by test_indirect_vector.cpp:

include/etl/indirect_vector.h:457:30: runtime error: member call on address 0x7ffe516756f0 which does not point to an object of type 'etl::ivector<TestDataDC<std::__cxx11::basic_string<char> > *>'
0x7ffe516756f0: note: object is of type 'etl::vector_base'
 fe 7f 00 00  d0 3b 5f 00 00 00 00 00  0c 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  18 57 67 51
              ^~~~~~~~~~~~~~~~~~~~~~~
              vptr for 'etl::vector_base'

will use it for reference.

jwellbelove commented 2 years ago

The branch I am using is derived from master. I am compiling on Ubuntu 20.04 with GCC 10.3.0

TEST(test_reference_counted_pool_exceptions) in test_shared_message.cpp

jwellbelove commented 2 years ago

"So far i find similar issue triggered by test_indirect_vector.cpp:"

I have also seen that, and seems to be a similar problem, as etl::ivector is derived from etl::vector_base.

jwellbelove commented 2 years ago

I have the GCC flags -fsanitize=address,undefined

melg8 commented 2 years ago

@jwellbelove I try to minimize reproducible example, so far minimal reproducible looks like this:

#include "unit_test_framework.h"

#include "etl/indirect_vector.h"

template <typename T>
class IIndirectVector
{
protected:
  IIndirectVector(etl::ivector<T*>& lookup_, etl::ipool& storage_)
    : lookup(lookup_)
    , storage(storage_)
  {
    //    lookup_.clear();  // Triggers.
  }

#if defined(ETL_POLYMORPHIC_INDIRECT_VECTOR) || defined(ETL_POLYMORPHIC_CONTAINERS)
public:
  virtual
#else
protected:
#endif
    ~IIndirectVector()
  {
    //    lookup.clear();  // Triggers.
    storage.available();  // Doesn't trigger.
  }

  etl::ivector<T*>& lookup;
  etl::ipool&       storage;
};

template <typename T, const size_t MAX_SIZE>
class TestVector : public IIndirectVector<T>
{
public:
  TestVector()
    : IIndirectVector<T>(lookup_vector, storage_pool), test(lookup_vector)
  {
    test.clear();  // Doesn't trigger.
  }

  etl::vector<T*, 1UL> lookup_vector;
  etl::ivector<T*>&    test;
  etl::pool<T, 1UL>    storage_pool;
};

namespace
{
  SUITE(test_indirect_vector)
  {
    struct SetupFixture
    {
    };

    TEST_FIXTURE(SetupFixture, test_default_constructor)
    {
      etl::vector<int*, 10UL> lookup_vector;
      etl::ivector<int*>&     lookup = lookup_vector;
      lookup.clear();  // Doesn't trigger.

      TestVector<int, 1UL> data;
    }
  };
}  // namespace
melg8 commented 2 years ago

@jwellbelove I find that ETL_DECLARE_DEBUG_COUNT in base class of vector_base causes problem. Idk why so far. Second version of minimization:

#include "unit_test_framework.h"

#include "etl/debug_count.h"

namespace etl
{
  class vector_base1
  {
  protected:
    vector_base1(size_t max_size_)
    {
    }

    virtual ~vector_base1()
    {
    }

    ETL_DECLARE_DEBUG_COUNT  // This triggers!
  };

  template <typename T>
  class ivector1 : public etl::vector_base1
  {
  protected:
    ivector1(T* p_buffer_, size_t MAX_SIZE)
      : vector_base1(MAX_SIZE)
    {
    }

    ~ivector1()
    {
    }

  public:
    void clear()
    {
    }
  };

  template <typename T, const size_t MAX_SIZE_>
  class vector1 : public etl::ivector1<T>
  {
  public:
    vector1()
      : etl::ivector1<T>(nullptr /*reinterpret_cast<T*>(&buffer)*/, MAX_SIZE_)
    {
    }
  };

}  // namespace etl

template <typename T>
class IIndirectVector
{
protected:
  IIndirectVector(etl::vector1<T*, 1>& lookup_)
    : lookup(lookup_)
  {
    //    lookup_.clear();  // Triggers.
  }

#if defined(ETL_POLYMORPHIC_INDIRECT_VECTOR) || defined(ETL_POLYMORPHIC_CONTAINERS)
public:
  virtual
#else
protected:
#endif
    ~IIndirectVector()
  {
    lookup.clear();  // Triggers.
  }

  etl::vector1<T*, 1>& lookup;
};

template <typename T, const size_t MAX_SIZE>
class TestVector : public IIndirectVector<T>
{
public:
  TestVector()
    : IIndirectVector<T>(lookup_vector)
  {
  }

  etl::vector1<T*, 1UL>
    lookup_vector;
};

SUITE(test_indirect_vector)
{
  struct SetupFixture
  {
  };

  TEST_FIXTURE(SetupFixture, test_default_constructor)
  {
    TestVector<int, 1UL> data;
  }
};
jwellbelove commented 2 years ago

Thanks for the investigation work. That's really narrowed the issue down.

melg8 commented 2 years ago

@jwellbelove what is interesting, if i replace debug_count with copy-paste definition, and just rename it to debug_count1 - issue disappear. Does main library build into lib and linked to tests, or it's just header only representation used in tests?

Edit: maybe i was wrong, will post final edition some time later.

melg8 commented 2 years ago

@jwellbelove here is reproducible and you can check different ideas with different compilers. https://godbolt.org/z/aKqxWTj8K

melg8 commented 2 years ago

@jwellbelove okay, i think, i found it. Not working: https://godbolt.org/z/q1jqKov4M Working: https://godbolt.org/z/vhnrGrM3q

So basically problem setup is here: https://github.com/ETLCPP/etl/blob/5468eb659c8b4ecdb3d08e8b8f0442c5a2549a48/include/etl/indirect_vector.h#L1422-L1426

And problem itself is here: https://github.com/ETLCPP/etl/blob/5468eb659c8b4ecdb3d08e8b8f0442c5a2549a48/include/etl/indirect_vector.h#L1321-L1324

And problem is in order of field initialization:

Initialization order

The order of member initializes in the list is irrelevant: the actual order of initialization is as follows:
1) If the constructor is for the most-derived class, virtual bases are initialized in the order in which they appear in depth-first left-to-right traversal of the base class declarations (left-to-right refers to the appearance in base-specifier lists)
2) Then, direct bases are initialized in left-to-right order as they appear in this class's base-specifier list
3) Then, non-static data member are initialized in order of declaration in the class definition.
4) Finally, the body of the constructor is executed

From here

So what happens is when initializing indirect_vector we at step 2) initialize etl::iindirect_vector base class, but we initialize them with values of lookup_vector and storage_pool which will be initialized only at step 3) And that's an undefined behavior.

People already mentioned this kind of problem.

I think you can try to adopt some variant of existing solution for this problem from boost - or alternatively change constructors somehow to initialize them in proper order.

Btw, would really appreciate if you star my repo cit (: slowly try to make it more popular, and gather people's opinions about how to improve it.

Hope my investigation helps and you will find best fitting solution for this problem!

jwellbelove commented 2 years ago

_So what happens is when initializing indirect_vector we at step 2) initialize etl::iindirect_vector base class, but we initialize them with values of lookup_vector and storagepool which will be initialized only at step 3) And that's an undefined behavior.

The base class is initialised with references to lookup_vector and storage_pool (which have had their storage defined at that point), which initialise local references in the base class. The base class constructor does not use them at all, so there is no undefined behaviour from calling un-constructed objects.

This is a technique (of calling the base class constructors with references to derived member resources) that is used throughout the ETL's containers and which the sanitizer appears to has no issues with in those cases.

I think the issue with reference counted messages and indirect vector is maybe a little more complex (though possibly related).

It may also be a false positive.

jwellbelove commented 2 years ago

I am also finding the sanitizer errors in TEST(test_reference_counted_pool_exceptions) hard to comprehend as the base class of etl::reference_counted_message (which is etl::ireference_counted_message) is pure virtual and is not passed any resources from the derived class.

The sanitizer gets upset at calling the overridden virtual function release() (which uses *this) after the etl::reference_counted_message has been constructed.

The sanitizer has no issues with calling the overridden virtual functions that do not use *this.

etl::reference_counted_message<Message1, etl::atomic_int> temp(message1, message_pool);
temp.get_reference_counter().get_reference_count(); // OK
temp.release(); // Sanitizer runtime error
melg8 commented 2 years ago

@jwellbelove can you upload whole setup, which reproduce (for example in ci) test_reference_counted_pool_exceptions in some separate branch, i was getting error with vector, but when i enable compilation of only test_shared_message.cpp with clang 13 and -fsanitize=undefined from master branch i get it passing:

Success: 5 tests passed.
Test time: 0.00 seconds.
jwellbelove commented 2 years ago

I'm using -fsanitize=address,undefined

melg8 commented 2 years ago

@jwellbelove i've tried with both flags enabled, in this case still, clang thinks all ok:

Success: 5 tests passed.
Test time: 0.00 seconds.

Btw i have ETL_NO_STL off. should i change it to ON?

On the bright side - even if this is somehow a false positive - we could make a reproducible out of it and give it to clang/gcc developers - because, at least for vector case - it has persisted across different compilers (you've used gcc, and i used clang).

jwellbelove commented 2 years ago

STL on or off makes no difference for me.

melg8 commented 2 years ago

@jwellbelove okay, so that is what you were talking about?

I have some issues with my system (can't compile with gcc using gust this flag - get linking error about dlsym) but at least in ci - i can see that error now.

Already interesting that only clang 9 on ubuntu triggers.

Btw there maybe issue with your pipeline. It is named Clang 9 linux but from what i see uses gcc:

  build-clang-9-linux-no-stl:
    name: Clang-9 Linux - No STL
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        os: [ubuntu-20.04]

    steps:
    - uses: actions/checkout@v2

    - name: Build
      run: |
        git fetch --tags
        cmake -DBUILD_TESTS=ON -DNO_STL=ON -DETL_USE_TYPE_TRAITS_BUILTINS=OFF -DETL_USER_DEFINED_TYPE_TRAITS=OFF -DETL_FORCE_TEST_CPP03=OFF ./
        gcc --version
        make
melg8 commented 2 years ago

@jwellbelove I've reduced test_reference_counted_pool_exceptions so far to this: https://godbolt.org/z/zG1fMzTb8 And version without any templates: https://godbolt.org/z/a1h7hY113

jwellbelove commented 2 years ago

The non-template is a good minimal example of the issue.

I've been checking using clang 10.0.0 and I get a completely different set of issues. It seems to have an intermittent 4/8 byte alignment error in the etl::unordered_multiset tests.

melg8 commented 2 years ago

@jwellbelove does that example speak to you in some way? https://godbolt.org/z/rPMs1xPPd I've manually added placement new from local buffer, and problem is gone.

So in case of placement new - object is de-allocated using release method, which calls destructor manually. But in case of stack based object - we get destructor called twice? first - with release method, second - at the end of scope? which mess-up vtable for object, so ubsan detects it as wrong kind of object? Illustration of this effect As mentioned here (stackoverflow answer) calling destructor twice is ub.

So i've tried this code:

#include "unit_test_framework.h"

#include "etl/fixed_sized_memory_block_allocator.h"
#include "etl/reference_counted_message_pool.h"
#include "etl/shared_message.h"

namespace {
constexpr etl::message_id_t MessageId1 = 1U;
constexpr etl::message_id_t MessageId2 = 2U;

struct Message1 : public etl::message<MessageId1> {
  Message1(int i_) : i(i_) {}

  ~Message1() {}

  int i;
};

struct Message2 : public etl::message<MessageId2> {
  ~Message2() {}
};

SUITE(test_shared_message) {
  using pool_message_parameters =
      etl::atomic_counted_message_pool::pool_message_parameters<Message1,
                                                                Message2>;

  etl::fixed_sized_memory_block_allocator<
      pool_message_parameters::max_size, pool_message_parameters::max_alignment,
      4U>
      memory_allocator;

  etl::atomic_counted_message_pool message_pool(memory_allocator);

  TEST(test_reference_counted_pool_exceptions) {
    using pool_message_parameters =
        etl::atomic_counted_message_pool::pool_message_parameters<Message1,
                                                                  Message2>;

    etl::fixed_sized_memory_block_allocator<
        pool_message_parameters::max_size,
        pool_message_parameters::max_alignment, 4U>
        memory_allocator;

    etl::atomic_counted_message_pool message_pool(memory_allocator);

    etl::reference_counted_message<Message1, etl::atomic_int>* prcm;
    CHECK_NO_THROW(prcm = message_pool.allocate<Message1>(6));

    using rcm = etl::reference_counted_message<Message1, etl::atomic_int>;
    alignas(rcm) unsigned char buf[sizeof(rcm)];

    Message1 message1(6);

    rcm* temp = new (buf) rcm(message1, message_pool);

    try {
      message_pool.release(*temp);
    } catch (etl::exception e) {
      CHECK_EQUAL(std::string("reference_counted_message_pool:release failure"),
                  std::string(e.what()));
    }
  }
}
}  // namespace

And it passes tests.

jwellbelove commented 2 years ago

The issue with the reference counted message test was that the release function would always destruct the object, regardless of whether the allocator was the owner or not.

jwellbelove commented 2 years ago

Fixed in 20.28.0