boostorg / serialization

Boost.org serialization module
http://boost.org/libs/serialization
119 stars 139 forks source link

Serialization failure with MSVC and version 1.84 #309

Closed johnwason closed 1 month ago

johnwason commented 6 months ago

We are seeing a new issue with serialization using MSVC and version 1.84. Unit tests are failing, and it looks like the serialization is being corrupted with binary archives, and some tests are failing with XML archives. The serialization has worked with previous versions of serialization. I did a matrix test using version 1.82, 1.83, and 1.84 to confirm. 1.82 and 1.83 passed, while 1.84 failed. Here is the workflow run:

https://github.com/johnwason/tesseract/actions/runs/8242279639

The upstream source repository is here: https://github.com/tesseract-robotics/tesseract

Both tesseract and boost-serialization are large projects so I am not sure where the problem is. Were there any major changes in boost-serialization that would explain the problem?

@Levi-Armstrong

robertramey commented 6 months ago

I looked at the links cited. I'm not sure how to understand the build script and how/if it points to the boost serialization library.

johnwason commented 6 months ago

This is the test that is failing: https://github.com/tesseract-robotics/tesseract/blob/master/tesseract_environment/test/tesseract_environment_serialization_unit.cpp

This contains the test utility functions that implement serialization: https://github.com/tesseract-robotics/tesseract/blob/master/tesseract_common/include/tesseract_common/unit_test_utils.h

LowLevelMahn commented 5 months ago

im currently upgrading to 1.84 - so just commenting here to be informed by github if there is a real issue

johnwason commented 5 months ago

I have narrowed down the problem to the "command" class tests, and tests that use commands.

An example of the test failure is EnvironmentCommandsSerializeUnit.AddLinkCommand in the test function testSerializationDerivedClass() located here: https://github.com/tesseract-robotics/tesseract/blob/ea3d5c37d466d55068a41e75de24ce2655f5c648/tesseract_environment/test/tesseract_environment_serialization_unit.cpp#L145

Within testSerializationDerivedClass() the failure is here: https://github.com/tesseract-robotics/tesseract/blob/ea3d5c37d466d55068a41e75de24ce2655f5c648/tesseract_common/include/tesseract_common/unit_test_utils.h#L145

As far as I can tell this looks like a failure related to polymorphic types. This error was not occurring in previous boost serialization versions.

The base class Command is defined here: https://github.com/tesseract-robotics/tesseract/blob/ea3d5c37d466d55068a41e75de24ce2655f5c648/tesseract_environment/include/tesseract_environment/command.h#L81

And the example subclass AddLinkCommand is located here: https://github.com/tesseract-robotics/tesseract/blob/ea3d5c37d466d55068a41e75de24ce2655f5c648/tesseract_environment/include/tesseract_environment/commands/add_link_command.h#L45

The implementation for the commands is located here: https://github.com/tesseract-robotics/tesseract/blob/ea3d5c37d466d55068a41e75de24ce2655f5c648/tesseract_environment/src/command.cpp#L35

https://github.com/tesseract-robotics/tesseract/blob/ea3d5c37d466d55068a41e75de24ce2655f5c648/tesseract_environment/src/commands/add_link_command.cpp#L41

There are many other tests failing that all use the Command base class.

@Levi-Armstrong

johnwason commented 5 months ago

Note that the serialization to string and XML works correctly so I don't think the issue is in the tesseract library.

johnwason commented 5 months ago

I did more tinkering and was able to find this difference between the failing unit test and an isolated example that seems to work correctly. ChangeJointPositionLimitsCommand2.binary is from the failing test, and ChangeJointPositionLimitsCommand3.binary is from the working test. The only difference is in like 60:

$ xxd ChangeJointPositionLimitsCommand2.binary
00000000: 1600 0000 0000 0000 7365 7269 616c 697a  ........serializ
00000010: 6174 696f 6e3a 3a61 7263 6869 7665 1400  ation::archive..
00000020: 0404 0408 0100 0000 0001 0000 0002 0020  ............... 
00000030: 0000 0000 0000 0043 6861 6e67 654a 6f69  .......ChangeJoi
00000040: 6e74 506f 7369 7469 6f6e 4c69 6d69 7473  ntPositionLimits
00000050: 436f 6d6d 616e 6401 0000 0000 0000 0000  Command.........
00000060: 0000 0000 0001 0000 000c 0000 0000 0000  ................
00000070: 0000 0100 0000 0000 0000 0800 0000 0000  ................
00000080: 0000 0000 0000 0000 0000 0006 0000 0000  ................
00000090: 0000 006a 6f69 6e74 3600 0000 0000 182d  ...joint6......-
000000a0: 4454 fb21 09c0 182d 4454 fb21 0940       DT.!...-DT.!.@

$ xxd ChangeJointPositionLimitsCommand3.binary
00000000: 1600 0000 0000 0000 7365 7269 616c 697a  ........serializ
00000010: 6174 696f 6e3a 3a61 7263 6869 7665 1400  ation::archive..
00000020: 0404 0408 0100 0000 0001 0000 0002 0020  ...............
00000030: 0000 0000 0000 0043 6861 6e67 654a 6f69  .......ChangeJoi
00000040: 6e74 506f 7369 7469 6f6e 4c69 6d69 7473  ntPositionLimits
00000050: 436f 6d6d 616e 6401 0000 0000 0000 0000  Command.........
00000060: 0100 0000 0001 0000 000c 0000 0000 0000  ................
00000070: 0000 0100 0000 0000 0000 0800 0000 0000  ................
00000080: 0000 0000 0000 0000 0000 0006 0000 0000  ................
00000090: 0000 006a 6f69 6e74 3600 0000 0000 182d  ...joint6......-
000000a0: 4454 fb21 09c0 182d 4454 fb21 0940       DT.!...-DT.!.@

The serialization step is here: https://github.com/johnwason/tesseract/blob/51a62b5bf3d844f8605b76879ac0f4c38b7d059b/tesseract_environment/src/commands/change_joint_position_limits_command.cpp#L81 and the class is defined here: https://github.com/johnwason/tesseract/blob/51a62b5bf3d844f8605b76879ac0f4c38b7d059b/tesseract_environment/include/tesseract_environment/commands/change_joint_position_limits_command.h#L45

Maybe someone with more knowledge of the serialization format can explain what could cause this difference.

johnwason commented 4 months ago

I have been able to isolate and reproduce the problem.

Here is the repo with the test code: https://github.com/johnwason/boost_serialization_1_84_bug

And here is the action run: https://github.com/johnwason/boost_serialization_1_84_bug/actions/runs/8745401644/job/24000327991

What I have discovered is that if the classes being serialized is in a shared library, then the serialization will fail. The failure only occurs for binary archives. The action run tests version 1.83 and version 1.82 of boost serialization, and it appears to work correctly.

Here is the cmake file. The tests for test_no_lib_prog works, while test_lib_prog fails.

cmake_minimum_required(VERSION 3.15.0)
project(boost_serialization_1_84_bug LANGUAGES CXX)

set(CMAKE_CXX_STANDARD 17)

find_package(Boost REQUIRED COMPONENTS system filesystem serialization)
find_package(GTest REQUIRED)

add_library(test_lib SHARED cmd.cpp cmd2.cpp)
target_link_libraries(test_lib Boost::boost
Boost::system
Boost::filesystem
Boost::serialization
GTest::GTest GTest::Main)
set_target_properties(test_lib PROPERTIES WINDOWS_EXPORT_ALL_SYMBOLS TRUE)

add_executable(test_lib_prog test.cpp)
target_link_libraries(test_lib_prog test_lib Boost::boost
Boost::system
Boost::filesystem
Boost::serialization
GTest::GTest GTest::Main)

add_executable(test_no_lib_prog test.cpp cmd.cpp cmd2.cpp)
target_link_libraries(test_no_lib_prog Boost::boost
Boost::system
Boost::filesystem
Boost::serialization
GTest::GTest GTest::Main)

enable_testing()

add_test(NAME test_no_lib COMMAND test_no_lib_prog)
add_test(NAME test_lib COMMAND test_lib_prog)

Here is the ctest log:

test 1
    Start 1: test_no_lib

1: Test command: D:\a\boost_serialization_1_84_bug\boost_serialization_1_84_bug\build\Release\test_no_lib_prog.exe
1: Working Directory: D:/a/boost_serialization_1_84_bug/boost_serialization_1_84_bug/build
1: Test timeout computed to be: 10000000
1: [==========] Running 6 tests from 2 test suites.
1: [----------] Global test environment set-up.
1: [----------] 3 tests from serialization_direct_test
1: [ RUN      ] serialization_direct_test.xml
1: [       OK ] serialization_direct_test.xml ([20](https://github.com/johnwason/boost_serialization_1_84_bug/actions/runs/8745401644/job/24000327991#step:6:21) ms)
1: [ RUN      ] serialization_direct_test.binary
1: [       OK ] serialization_direct_test.binary (0 ms)
1: [ RUN      ] serialization_direct_test.xml_string
1: [       OK ] serialization_direct_test.xml_string (0 ms)
1: [----------] 3 tests from serialization_direct_test ([21](https://github.com/johnwason/boost_serialization_1_84_bug/actions/runs/8745401644/job/24000327991#step:6:22) ms total)
1: 
1: [----------] 3 tests from serialization_derived_test
1: [ RUN      ] serialization_derived_test.xml
1: [       OK ] serialization_derived_test.xml (1 ms)
1: [ RUN      ] serialization_derived_test.binary
1: [       OK ] serialization_derived_test.binary (0 ms)
1: [ RUN      ] serialization_derived_test.xml_string
1: [       OK ] serialization_derived_test.xml_string (0 ms)
1: [----------] 3 tests from serialization_derived_test (2 ms total)
1: 
1: [----------] Global test environment tear-down
1: [==========] 6 tests from 2 test suites ran. ([24](https://github.com/johnwason/boost_serialization_1_84_bug/actions/runs/8745401644/job/24000327991#step:6:25) ms total)
1: [  PASSED  ] 6 tests.
1/2 Test #1: test_no_lib ......................   Passed    0.08 sec
test 2
    Start 2: test_lib

2: Test command: D:\a\boost_serialization_1_84_bug\boost_serialization_1_84_bug\build\Release\test_lib_prog.exe
2: Working Directory: D:/a/boost_serialization_1_84_bug/boost_serialization_1_84_bug/build
2: Test timeout computed to be: 10000000
2: [==========] Running 6 tests from 2 test suites.
2: [----------] Global test environment set-up.
2: [----------] 3 tests from serialization_direct_test
2: [ RUN      ] serialization_direct_test.xml
2: [       OK ] serialization_direct_test.xml (1 ms)
2: [ RUN      ] serialization_direct_test.binary
2: D:\a\boost_serialization_1_84_bug\boost_serialization_1_84_bug\unit_test_utils.h(73): error: Value of: *object_derived != *nobject_derived
2:   Actual: true
2: Expected: false
2: 
2: [  FAILED  ] serialization_direct_test.binary (0 ms)
2: [ RUN      ] serialization_direct_test.xml_string
2: [       OK ] serialization_direct_test.xml_string (0 ms)
2: [----------] 3 tests from serialization_direct_test (2 ms total)
2: 
2: [----------] 3 tests from serialization_derived_test
2: [ RUN      ] serialization_derived_test.xml
2: [       OK ] serialization_derived_test.xml (1 ms)
2: [ RUN      ] serialization_derived_test.binary
2: D:\a\boost_serialization_1_84_bug\boost_serialization_1_84_bug\unit_test_utils.h(73): error: Value of: *object_derived != *nobject_derived
2:   Actual: true
2: Expected: false
2: 
2: [  FAILED  ] serialization_derived_test.binary (1 ms)
2: [ RUN      ] serialization_derived_test.xml_string
2: [       OK ] serialization_derived_test.xml_string (0 ms)
2: [----------] 3 tests from serialization_derived_test (3 ms total)
2: 
2: [----------] Global test environment tear-down
2: [==========] 6 tests from 2 test suites ran. (6 ms total)
2: [  PASSED  ] 4 tests.
2: [  FAILED  ] 2 tests, listed below:
2: [  FAILED  ] serialization_direct_test.binary
2: [  FAILED  ] serialization_derived_test.binary
2: 
2:  2 FAILED TESTS
2/2 Test #2: test_lib .........................***Failed    0.02 sec

[50](https://github.com/johnwason/boost_serialization_1_84_bug/actions/runs/8745401644/job/24000327991#step:6:51)% tests passed, 1 tests failed out of 2

Total Test time (real) =   0.10 sec

The following tests FAILED:
      2 - test_lib (Failed)
Errors while running CTest
johnwason commented 3 months ago

I am still seeing this failure in boost 1.85.

johnwason commented 3 months ago

Are there any suggestions where I can begin troubleshooting?

correaa commented 3 months ago

I can’t help you, I don’t have experience with Windows. but I am curious what you do mean by “classes being serialized is in a shared library”? do you mean that especial members are linked in a dynamic library? dll?

robertramey commented 3 months ago

Sorry I haven't been more responsive. As far as Boost is concerned, I've been bogged down in build/test issues which have nothing to do with the serialization library itself.

robertramey commented 3 months ago

You've done some real work here. I would agree that your experiments with different archives suggest that it's an issue with binary archive suggest that the problem lies there. That narrows things down a lot.

Could you send the source code to your tests - both the failing and passing one?

johnwason commented 3 months ago

The source code for the isolated test is here: https://github.com/johnwason/boost_serialization_1_84_bug . It includes an actions workflow that does a parametric test of different boost versions.

The original problem was discovered in the Tesseract robot motion planner: https://github.com/tesseract-robotics/tesseract . The isolated test was extracted from this source.

correaa commented 3 months ago

The source code for the isolated test is here: https://github.com/johnwason/boost_serialization_1_84_bug . It includes an actions workflow that does a parametric test of different boost versions.

Excuse my ignorance, but what is the meaning of "&=" for an Archive in cmd.cpp (line 15 for example) ?

robertramey commented 3 months ago

LOL - good question! I'd be curious to hear the answer as well.

Levi-Armstrong commented 3 months ago

It looks like it is a typo. There are 656 lines of serialization code within Tesseract and only two lines of code has the syntax ar&= which is clearly not correct.

Levi-Armstrong commented 3 months ago

@robertramey and @correaa Any thoughts why this syntax works at all?

robertramey commented 3 months ago

I'm mystified by this. I did look into it but didn't see wow it would not produce a syntax error. I'm going to run your test case with the debugger. I expect this will address the error you found and also shed light on this.

johnwason commented 3 months ago

I changed &= to & and it appears to have worked?? https://github.com/johnwason/boost_serialization_1_84_bug/actions/runs/9454574752/job/26042427949

I am getting the warning "warning C4552: '&': result of expression not used" now. Can this be ignored, or is there something else we are missing?

I am going to test the main tesseract library to see if the test pass.

correaa commented 3 months ago

I guess that &= was converting either side of the expression to int. I am unaware that such conversions are possible, but it is worth checking. Also compiling with all conversion warnings on.

johnwason commented 3 months ago

My isolated test case is fixed, but I am still getting crashes and failures in the main library. I am only seeing this issue for the binary archive so I don't think it is something in the client code.

correaa commented 3 months ago

Isolate the error again?

robertramey commented 3 months ago
  1. create test case. You've done that. I'm not familiar with GTEST so I might change the test case again.
  2. verify that the test case builds and still fails on my development system.
  3. use the debugger in my IDE to step through the test case. Since you've got a passing AND a failing instance, it should be fairly easily to compare the two and isolate the error. It is tedious though. But it does work.
  4. I would do the same with the &= issue. This should permit one to trace into any conversions which occur. when you find something, looking at the stack in the debugger is very useful.

As I've said, I don't see the operator &=defined for an archive type - so I would expect a syntax error.

johnwason commented 3 months ago

It looks like I nerfed the test rather than fixing the error. Here is the corrected run showing that it is still failing:

https://github.com/johnwason/boost_serialization_1_84_bug/actions/runs/9458596594

johnwason commented 3 months ago

One possibility I noticed is that there is a singleton template that is being initialized with separate static values in the different libraries. This would mean that the base class is seeing a different singleton than the derived class when trying to serialize. Was something involving singletons changed in the more recent versions?

Levi-Armstrong commented 1 month ago

@robertramey and @correaa This is still failing after the fix.

Levi-Armstrong commented 1 month ago

I found the other issue. The cpp was missing BOOST_CLASS_EXPORT_IMPLEMENT(tesseract_environment1::Command).

johnwason commented 1 month ago

The fix by @Levi-Armstrong fixed the test case. I am working on the full library test now.

... Why did this work before for other boost versions and different OS??

johnwason commented 1 month ago

This appears to be fixed with https://github.com/tesseract-robotics/tesseract/pull/1035

Closing the issue.