autowarefoundation / autoware.universe

https://autowarefoundation.github.io/autoware.universe/
Apache License 2.0
1.01k stars 650 forks source link

On abusive use of `reinterpret_cast` UB #3215

Closed VRichardJP closed 1 year ago

VRichardJP commented 1 year ago

Checklist

Description

As pointed here, there are places in Autoware where reinterpret_cast is abusively used to "convert" data types.

There are endless talks over the internet about how misleading reinterpret_cast is. Basically, dereferencing a reinterpreted pointer p with *reinterpret_cast<T*>(p) is only valid if p originally points to data of type T (e.g. p = new T()), or a const/signed/unsigned variant, or if T==char (or unsigned char or std::byte). If p originally points to any other type, even if "it looks" the same, is straight UB.

For example, copying PointXYIZ from uint8_t[] like here is UB:

https://github.com/autowarefoundation/autoware.universe/blob/cb103ec9c8149fffcf2a665ca37b69fe465e0a61/sensing/pointcloud_preprocessor/src/outlier_filter/ring_outlier_filter_nodelet.cpp#L123

To create a valid PointXYZI from the buffer, it would be necessary at least to cast each field separately (e.g. like p.x = *reinterpret_cast<float*>(&data[p_offset + x_field_offset])).

Edit: according to this excellent writeup, reinterpret casting from a buffer violates strict aliasing rules (but I guess it is ok in the scope of an immediate copy?), and could run into misaligned memory issues (but all the data we manipulate is always aligned, right?...). The author suggests the only safe way to manipulate data is to memcpy (yew!). Hopefully, C++ makes it possible to wrap things and avoid the memcpy boilerplate (for example the const_unaligned_pointer described here: http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html)

Similarly, even though PointXYZIRADRT and PointXYZI "start the same", it is UB to use reinterpret_cast to copy between the 2 structs. For example:

PointXYZIRADRT p1{};
PointXYZI p2 = *reinterpret_cast<PointXYZI*>(&p1); // UB

It may "work" now... until the compiler decides otherwise.

Purpose

Remove all UB reinterpret_cast in Autoware code base.

Possible approaches

  1. grep!
  2. fix

Definition of done

No more yolo reinterpret_cast UB.

VRichardJP commented 1 year ago

Taxonomy of reinterpret_cast usage in autoware.universe

1. Passing pointer to an API

For example:

sockaddr_in client;
int new_sock = accept(sock, reinterpret_cast<sockaddr *>(&client), &len);

Valid. Exactly what reinterpret_cast has been designed for. Note that it is recommended to use static_cast in the case it is only necessary to cast from/to void*.

2. Reading/Writing an object byte representation

For example:

file.write(reinterpret_cast<const char *>(plan_->data()), plan_->size());

Valid. It is possible to cast into char*, unsigned char*, std::byte* and their const variants. Writing the object bytes is legal, but not portable (e.g. because endianess, data type size, etc).

3. Reading/Writing from/into a buffer as if is was an object

For example:

reinterpret_cast<float32_t *>(v.data())[i] = dis(gen)

This is UB and dangerous. I know everyone does that and "it works" because most compilers are kind enough, but it's a tic-toc bomb you leave behind. First, why is that not valid C++ code? C++ has many complex rules regarding what makes a valid "object" and using reinterpret_cast<T*> on a buffer is not one of them. reinterpret_cast is basically a way to say to the compiler "trust me, someone else has created a valid object here before".

In practice however, this UB is known to work perfectly fine under the following conditions:

  1. T is trivially copiable (such as int, float, struct { int, char })
  2. the buffer meet the alignment requirement of T (e.g. float must be aligned on 4 bytes)
  3. you are not messing around with aliasing (because you violate strict aliasing rule)

Condition 1 is not a problem in autoware code base, as I have ever only seen reinterpret_cast with simple types and structs. Condition 3 is normally not a problem in this context. To make the compiler break the program because of strict aliasing, you really have to want it. Condition 2 is the important one. If the buffer is not aligned properly, the compiler may definitely break the program. Luckily (or unluckily depending how you see it) modern x86 and ARM64 can access badly aligned data (just slightly slower). So the program may still run fine when optimization level is low. However if you start to push the compiler for more advanced optimizations (e.g. -msse, -march=native), then your program will either compute rubbish and simply fail. A good illustration here.

If 1, 2 and 3 are met, the reinterpret_cast is officially an undefined behavior but works in practice. However, if you can't be 100% sure the data is properly aligned (for example if the buffer is created by another library), then it is better to use std::memcpy. With -O2 or -O3 the memcpy has no performance impact over the unsafe reinterpret_cast.

4. Reading/Writing data from different types

This is just a worse version of 3. Even if the types looks the same, there is no reason not to copy the data properly, which can't break the program and will be optimized out by the compiler.

Grep time

Here is a list of all reinterpret_cast use in universe, categorized by pattern (from 1 to 4). For type 3, I make the difference between A and B: A when the cast is "safe" (typically alignment is correct) and B when it smells:

perception/traffic_light_ssd_fine_detector/src/nodelet.cpp
240:        reinterpret_cast<const float *>(pixels.datastart), // Type 3A: cv::Mat data is aligned on 16-bytes
241:        reinterpret_cast<const float *>(pixels.dataend));  // Type 3A: cv::Mat data is aligned on 16-bytes

perception/traffic_light_ssd_fine_detector/lib/src/trt_ssd.cpp
164:  file.write(reinterpret_cast<const char *>(plan_->data()), plan_->size()); // Type 2

perception/lidar_apollo_instance_segmentation/src/detector.cpp
72:    outfile.write(reinterpret_cast<char *>(plan->data()), plan->size()); // Type 2

perception/lidar_apollo_instance_segmentation/lib/include/Utils.hpp
122:  *reinterpret_cast<T *>(buffer) = val;        // Type 3B: unaligned write -> should use memcpy instead
129:  val = *reinterpret_cast<const T *>(buffer);  // Type 3B: unaligned read -> should use memcpy instead

perception/traffic_light_ssd_fine_detector/lib/include/cuda_utils.hpp
60:  CHECK_CUDA_ERROR(::cudaMalloc(reinterpret_cast<void **>(&p), sizeof(U) * n));  // Type 1
70:  CHECK_CUDA_ERROR(::cudaMalloc(reinterpret_cast<void **>(&p), sizeof(T)));      // Type 1

perception/traffic_light_classifier/utils/trt_common.cpp
99:    reinterpret_cast<const void *>(engine_str.data()), engine_str.size()));      // Type 1

perception/traffic_light_classifier/utils/trt_common.hpp
94:  ::cudaMalloc(reinterpret_cast<void **>(&p), sizeof(U) * n); // Type 1
104:  ::cudaMalloc(reinterpret_cast<void **>(&p), sizeof(T));    // Type 1

perception/bytetrack/lib/include/lapjv.h
52:  if ((x = reinterpret_cast<t *>(malloc(sizeof(t) * (n)))) == 0) { \  // Speciale case. Looks like a C code (t*)malloc(...) rewritten for C++. static_cast<T*>(malloc()) is a valid C++ object construct, but reinterpret_cast is not. I doubt any compiler would break this, but it could simplified to (x = new t[(n)]). No need for any cast

perception/lidar_apollo_segmentation_tvm/test/main.cpp
55:    reinterpret_cast<float32_t *>(v.data())[i] = dis(gen); // Type 3B: v.data() is not aligned to float32_t 

perception/compare_map_segmentation/include/compare_map_segmentation/multi_voxel_grid_map_update.hpp
175:          const std::uint8_t * pt_data = reinterpret_cast<const std::uint8_t *>(&(*input_)[index]);  // Type 2. If you want to be pedantic uint8_t is not official way to access object internal bytes (but in practice uint8_t == unsigned char).

perception/tensorrt_yolo/lib/src/trt_yolo.cpp
276:  file.write(reinterpret_cast<const char *>(plan_->data()), plan_->size());  // Type 2

perception/tensorrt_yolo/lib/include/cuda_utils.hpp
80:  CHECK_CUDA_ERROR(::cudaMalloc(reinterpret_cast<void **>(&p), sizeof(U) * n));  // Type 1
88:  CHECK_CUDA_ERROR(::cudaMalloc(reinterpret_cast<void **>(&p), sizeof(T)));      // Type 1
113:  T * ptr = reinterpret_cast<T *>(workspace);  // Type 3A: workspace is aligned for type T
114:  workspace = reinterpret_cast<void *>(reinterpret_cast<uintptr_t>(workspace) + size); // Type 1 and some legal pointer arithmetic

perception/tensorrt_yolo/lib/include/calibrator.hpp
187:    output.write(reinterpret_cast<const char *>(cache), length); // Type 2

perception/tensorrt_yolo/lib/src/plugins/nms.cu
149:    int * num_selected = reinterpret_cast<int *>(indices_sorted); // Unecessary cast, indices_sorted is already of type int*

perception/tensorrt_yolo/lib/src/plugins/yolo_layer_plugin.cpp
90:  *reinterpret_cast<T *>(buffer) = val;  // Type 3B: unaligned write -> should use memcpy instead
97:  val = *reinterpret_cast<const T *>(buffer); // Type 3B: unaligned write -> should use memcpy instead
160:  char * d = reinterpret_cast<char *>(buffer); // Type 1

perception/tensorrt_yolo/lib/src/plugins/nms_plugin.cpp
65:  *reinterpret_cast<T *>(buffer) = val;  // Type 3B: unaligned write -> should use memcpy instead
72:  val = *reinterpret_cast<const T *>(buffer);  // Type 3B: unaligned write -> should use memcpy instead

system/bluetooth_monitor/src/bluetooth_monitor.cpp
75:  ret = connect(socket_, reinterpret_cast<sockaddr *>(&address), sizeof(address));  // Type 1

system/bluetooth_monitor/service/l2ping_service.cpp
49:    socket_, SOL_SOCKET, SO_REUSEADDR, reinterpret_cast<char *>(&opt), (socklen_t)sizeof(opt)); // Type 1
86:    int new_sock = accept(socket_, reinterpret_cast<sockaddr *>(&client), &len); // Type 1

system/system_monitor/src/net_monitor/nl80211.cpp
37:  auto * rate = reinterpret_cast<float *>(arg); // Type 1.
124:  ret = nl_cb_set(cb_, NL_CB_VALID, NL_CB_CUSTOM, callback, reinterpret_cast<void *>(&bitrate_)); // Type 1

system/system_monitor/reader/msr_reader/msr_reader.cpp
104:    sock, SOL_SOCKET, SO_REUSEADDR, reinterpret_cast<char *>(&opt), (socklen_t)sizeof(opt)); // Type 1
137:    int new_sock = accept(sock, reinterpret_cast<sockaddr *>(&client), &len); // Type 1

system/system_monitor/reader/hdd_reader/hdd_reader.cpp
214:  info->serial_ = std::string(reinterpret_cast<char *>(data) + 20, 20); // Type 2
219:  info->model_ = std::string(reinterpret_cast<char *>(data) + 54, 40); // Type 2
388:  info->total_data_written_ = *(reinterpret_cast<uint64_t *>(&data[48])); // Type 3B: data is not aligned
392:  info->power_on_hours_ = *(reinterpret_cast<uint64_t *>(&data[128]));  // Type 3B: data is not aligned
536:    sock, SOL_SOCKET, SO_REUSEADDR, reinterpret_cast<char *>(&opt), (socklen_t)sizeof(opt)); // Type 1
568:    int new_sock = accept(sock, reinterpret_cast<sockaddr *>(&client), &len); // Type 1

system/system_monitor/test/src/hdd_monitor/test_hdd_monitor.cpp
185:  ThreadTestMode * mode = reinterpret_cast<ThreadTestMode *>(args); // Type 1
197:    sock, SOL_SOCKET, SO_REUSEADDR, reinterpret_cast<char *>(&opt), (socklen_t)sizeof(opt)); // Type 1
226:  int new_sock = accept(sock, reinterpret_cast<sockaddr *>(&client), &len); // Type 1

system/system_monitor/test/src/cpu_monitor/test_intel_cpu_monitor.cpp
179:  ThreadTestMode * mode = reinterpret_cast<ThreadTestMode *>(args); // is originally a ThreadTestMode
191:    sock, SOL_SOCKET, SO_REUSEADDR, reinterpret_cast<char *>(&opt), (socklen_t)sizeof(opt)); // Type 1
220:  int new_sock = accept(sock, reinterpret_cast<sockaddr *>(&client), &len); // Type 1

perception/lidar_centerpoint/lib/network/tensorrt_wrapper.cpp
151:  file.write(reinterpret_cast<const char *>(plan_->data()), plan_->size()); // Type 2
162:    reinterpret_cast<const void *>(engine_str.data()), engine_str.size())); // Type 1

perception/lidar_centerpoint/include/lidar_centerpoint/cuda_utils.hpp
80:  CHECK_CUDA_ERROR(::cudaMalloc(reinterpret_cast<void **>(&p), sizeof(U) * n)); // Type 1
88:  CHECK_CUDA_ERROR(::cudaMalloc(reinterpret_cast<void **>(&p), sizeof(T))); // Type 1
113:  T * ptr = reinterpret_cast<T *>(workspace); // Type 3A: workspace is aligned for type T
114:  workspace = reinterpret_cast<void *>(reinterpret_cast<uintptr_t>(workspace) + size);  // Type 1 and some legal pointer arithmetic

common/cuda_utils/include/cuda_utils/cuda_unique_ptr.hpp
43:  CHECK_CUDA_ERROR(::cudaMalloc(reinterpret_cast<void **>(&p), sizeof(U) * n)); // Type 1
51:  CHECK_CUDA_ERROR(::cudaMalloc(reinterpret_cast<void **>(&p), sizeof(T))); // Type 1
68:  CHECK_CUDA_ERROR(::cudaHostAlloc(reinterpret_cast<void **>(&p), sizeof(U) * n, flag)); // Type 1
76:  CHECK_CUDA_ERROR(::cudaHostAlloc(reinterpret_cast<void **>(&p), sizeof(T), flag)); // Type 1

common/tensorrt_common/src/tensorrt_common.cpp
99:    reinterpret_cast<const void *>(engine_str.data()), engine_str.size())); // Type 1
194:  file.write(reinterpret_cast<const char *>(data->data()), data->size()); // Type 2
196:  file.write(reinterpret_cast<const char *>(plan->data()), plan->size()); // Type 2

sensing/pointcloud_preprocessor/src/crop_box_filter/crop_box_filter_nodelet.cpp
131:      *reinterpret_cast<const float *>(&input->data[global_offset + x_offset]), // Type 3B: unknown data alignment
132:      *reinterpret_cast<const float *>(&input->data[global_offset + y_offset]), // Type 3B: unknown data alignment
133:      *reinterpret_cast<const float *>(&input->data[global_offset + z_offset]), 1); // Type 3B: unknown data alignment
152:        *reinterpret_cast<float *>(&output.data[output_size + x_offset]) = point[0]; // Type 3B: unknown data alignment
153:        *reinterpret_cast<float *>(&output.data[output_size + y_offset]) = point[1]; // Type 3B: unknown data alignment
154:        *reinterpret_cast<float *>(&output.data[output_size + z_offset]) = point[2]; // Type 3B: unknown data alignment

sensing/pointcloud_preprocessor/src/outlier_filter/ring_outlier_filter_nodelet.cpp
61:    input_ring_map[*reinterpret_cast<uint16_t *>(&input_ptr->data[idx + ring_offset])].push_back( // Type 3B: unknown data alignment
87:        *reinterpret_cast<float *>(&input_ptr->data[current_idx + azimuth_offset]);  // Type 3B: unknown data alignment
89:        *reinterpret_cast<float *>(&input_ptr->data[next_idx + azimuth_offset]); // Type 3B: unknown data alignment
94:        *reinterpret_cast<float *>(&input_ptr->data[current_idx + distance_offset]); // Type 3B: unknown data alignment
96:        *reinterpret_cast<float *>(&input_ptr->data[next_idx + distance_offset]); // Type 3B: unknown data alignment
107:            std::move(*reinterpret_cast<PointXYZI *>(&input_ptr->data[tmp_idx]))); // Type 4: unknown data alignment + most likely incompatible data type
118:          std::move(*reinterpret_cast<PointXYZI *>(&input_ptr->data[tmp_idx]))); // Type 4: unknown data alignment + most likely incompatible data type

sensing/pointcloud_preprocessor/include/pointcloud_preprocessor/outlier_filter/ring_outlier_filter_nodelet.hpp
51:    PointXYZI * front_pt = reinterpret_cast<PointXYZI *>(&input_ptr->data[tmp_indices.front()]); // Type 4: unknown data alignment + most likely incompatible data type
52:    PointXYZI * back_pt = reinterpret_cast<PointXYZI *>(&input_ptr->data[tmp_indices.back()]); // Type 4: unknown data alignment + most likely incompatible data type

sensing/pointcloud_preprocessor/include/pointcloud_preprocessor/passthrough_filter/passthrough_uint16.hpp
430:        reinterpret_cast<const std::uint8_t *>(&input_->points[(*indices_)[iii]]); // Type 2. If you want to be pedantic uint8_t is not official way to access object internal bytes (but in practice uint8_t == unsigned char).

sensing/tier4_pcl_extensions/include/tier4_pcl_extensions/voxel_grid_nearest_centroid_impl.hpp
175:      const std::uint8_t * pt_data = reinterpret_cast<const std::uint8_t *>(&input_->points[cp]); // Type 2. If you want to be pedantic uint8_t is not official way to access object internal bytes (but in practice uint8_t == unsigned char).
221:            &rgb, reinterpret_cast<const char *>(&input_->points[cp]) + rgba_index, sizeof(int)); // Type 2
272:            &rgb, reinterpret_cast<const char *>(&input_->points[cp]) + rgba_index, sizeof(int)); // Type 2

As explained, pattern 1 and 2 are valid, 3A is "UB but safe" so acceptable, however I think code with pattern 3B and 4 should be fixed. Even if the code "works" today, it may not be the same if new platforms, compiler or optimization flags are used.

ambroise-arm commented 1 year ago

Very interesting, thanks for the report.

With -O2 or -O3 the memcpy has no performance impact over the unsafe reinterpret_cast.

Do you have resources on this point? It is not obvious to me that it is true.

VRichardJP commented 1 year ago

It is the common consensus I have found on the internet: For example: https://gist.github.com/shafik/848ae25ee209f698763cffee272a58f8#how-do-we-type-pun-correctly

At a sufficient optimization level any decent modern compiler generates identical code to the previously mentioned reinterpret_cast method or union method for type punning.

The article give this example+%3B%0A%0Avoid+func1(+double+d+)%0A%7B%0A++++std::int64_t+n%3B%0A++++std::memcpy(%26n,+%26d,+sizeof+d)%3B+//+OK%0A++++printf(+%22%25%22+PRId64+%22%5Cn%22,+n+)+%3B%0A%7D%0A%0Avoid+func2(+double+d+)%0A%7B%0A++++union+u1%0A++++%7B%0A++++++std::int64_t+n%3B%0A++++++double+d+%3B%0A++++%7D+%3B%0A++%0A++++u1+u+%3B%0A++++u.d+%3D+d+%3B%0A++++printf(+%22%25%22+PRId64+%22%5Cn%22,+u.n+)+%3B%0A%7D%0A%0Avoid+func3(+double+d+)%0A%7B%0A++++std::int64_t+n+%3B%0A++++n+%3D+reinterpret_cast%3Cstd::int64_t+%3E(%26d)+%3B%0A++++%0A++++printf(+%22%25%22+PRId64+%22%5Cn%22,+n+)+%3B%0A%7D%0A%0Aint+main()%0A%7B%0A++double+d+%3D+0.1%3B%0A++%0A++func1(+d+)+%3B%0A%7D'),l:'5',n:'0',o:'C%2B%2B+source+%231',t:'0')),k:50,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:clang_trunk,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'1',trim:'1'),lang:c%2B%2B,libs:!(),options:'-std%3Dc%2B%2B17+-O2',source:1),l:'5',n:'0',o:'x86-64+clang+(trunk)+(Editor+%231,+Compiler+%231)+C%2B%2B',t:'0')),k:50,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4)

This other article give a more advanced use case: http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html

These days, however, it doesn’t have to be implemented as a procedure call at all; the compiler can treat it as an intrinsic, and replace with a memory-to-register load. The GCC definitely does. It compiles the following code into the code similar to the original SSE one, except it uses movdqu instead of movdqa. This instruction allows misaligned data; its performance varies. On some processors it is much slower than movdqa even when the data is actually aligned. On others it runs at virtually the same speed.

Edit: In this PR, I have replaced a few reinterpret_cast by memcpy. I have disassembled reported the hottest spot in the function. This is a memcpy that have been translated to x86 movups instructions (move unpacked 512 bits). I would not be surprised if the reinterpret_cast i changed was initially compiled with a movaps (move aligned).

I don't think we are in need of performance to the point we have to care about such platform specific details. Program correctness is way more important than micro performance improvement.

ambroise-arm commented 1 year ago

Thanks!

I don't think we are in need of performance to the point we have to care about such platform specific details. Program correctness is way more important than micro performance improvement.

Absolutely, I wasn't trying to argue the rational for the change.