ApolloAuto / apollo

An open autonomous driving platform
Apache License 2.0
25.07k stars 9.68k forks source link

read buffer overflow and drop message when cyber_recorder record #15282

Open xhzzhang opened 8 months ago

xhzzhang commented 8 months ago

Describe the bug Hey, I found when launch camera(1920x1080 yuv), one of serval record files will has drop 2~12 seconds.

To Reproduce Steps to reproduce the behavior:

  1. use orin 32G and installed ubuntu 20.04
  2. use apollo 8.0 source code, and install docker dev-aarch64-20.04-20231024_1054
  3. cyber_launch start module/drivers/camera/launch/front_6mm or other camera launch file.
  4. see data/log/cyberrecorder***

Expected behavior record works fine and completed.

Screenshots please see the log message.

Desktop (please complete the following information):

OS: [e.g. iOS] ubuntu 20.04
Browser [e.g. chrome, safari] firefox
Version [e.g. 22]

Additional context Below is log message:

Log file created at: 2024/01/07 15:40:50 Running on machine: in-dev-docker Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg I0107 15:40:50.759305 11829 global_data.cc:153] host ip: 192.168.1.156 W0107 15:40:50.760537 11829 scheduler_factory.cc:63] Scheduler conf named /apollo/cyber/conf/cyber_recorder_11829.conf not found, use default. I0107 15:40:50.760813 11831 processor.cc:42] processor_tid: 11831 I0107 15:40:50.777197 11832 processor.cc:42] processor_tid: 11832 I0107 15:40:50.777417 11833 processor.cc:42] processor_tid: 11833 I0107 15:40:50.781282 11834 processor.cc:42] processor_tid: 11834 I0107 15:40:50.785274 11835 processor.cc:42] processor_tid: 11835 I0107 15:40:50.811200 11836 processor.cc:42] processor_tid: 11836 I0107 15:40:50.811486 11837 processor.cc:42] processor_tid: 11837 I0107 15:40:50.813271 11838 processor.cc:42] processor_tid: 11838 I0107 15:40:50.829242 11839 processor.cc:42] processor_tid: 11839 I0107 15:40:50.829638 11840 processor.cc:42] processor_tid: 11840 I0107 15:40:50.829968 11841 processor.cc:42] processor_tid: 11841 I0107 15:40:50.830096 11842 processor.cc:42] processor_tid: 11842 I0107 15:40:50.830209 11843 processor.cc:42] processor_tid: 11843 I0107 15:40:50.830271 11844 processor.cc:42] processor_tid: 11844 I0107 15:40:50.830346 11845 processor.cc:42] processor_tid: 11845 I0107 15:40:50.840992 11846 processor.cc:42] processor_tid: 11846 I0107 15:40:50.841241 11829 init.cc:113] Register exit handle succ. I0107 15:40:51.147177 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/planning I0107 15:40:51.156020 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/sensor/livox/lidar_n/CustomMsg I0107 15:40:51.161494 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/monitor I0107 15:40:51.179752 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/tf I0107 15:40:51.184409 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/sensor/gnss/gnss_status I0107 15:40:51.188345 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/sensor/gnss/ins_status I0107 15:40:51.192723 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/sensor/gnss/best_pose I0107 15:40:51.196694 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/sensor/gnss/corrected_imu I0107 15:40:51.199668 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/sensor/gnss/ins_stat I0107 15:40:51.202936 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/sensor/gnss/rtk_eph I0107 15:40:51.207324 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/sensor/gnss/rtk_obs I0107 15:40:51.212024 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/sensor/gnss/heading I0107 15:40:51.216632 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/sensor/gnss/imu I0107 15:40:51.220008 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/sensor/gnss/odometry I0107 15:40:51.224087 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/sensor/gnss/stream_status I0107 15:40:51.227181 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/sensor/gnss/raw_data I0107 15:40:51.229889 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/sensor/gnss/rtcm_data I0107 15:40:51.233539 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/canbus/chassis_detail I0107 15:40:51.237504 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/canbus/chassis I0107 15:40:51.242409 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/localization/pose I0107 15:40:51.271842 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/perception/obstacles I0107 15:40:51.274420 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/sensor/gnss/cust/imu I0107 15:40:51.280233 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/control I0107 15:40:51.283149 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/sensor/camera/front_6mm/image I0107 15:40:51.291419 11852 scheduler.cc:55] create croutine: cyber_recorder_record11829/apollo/sensor/livox/lidar_360/CustomMsg W0107 15:41:41.821874 11846 channel_buffer.h:70] channel[/apollo/control] read buffer overflow, drop_message[101] pre_index[3669] current_index[3770] W0107 15:42:21.469127 11837 channel_buffer.h:70] channel[/apollo/sensor/camera/front_6mm/image] read buffer overflow, drop_message[138] pre_index[718] current_index[856] W0107 15:42:21.469300 11831 channel_buffer.h:70] channel[/apollo/sensor/gnss/cust/imu] read buffer overflow, drop_message[2904] pre_index[15135] current_index[18039] W0107 15:42:21.469463 11832 channel_buffer.h:70] channel[/apollo/control] read buffer overflow, drop_message[562] pre_index[6707] current_index[7269] W0107 15:42:21.469538 11842 channel_buffer.h:70] channel[/apollo/canbus/chassis] read buffer overflow, drop_message[1451] pre_index[7571] current_index[9022] W0107 15:42:21.469605 11844 channel_buffer.h:70] channel[/apollo/localization/pose] read buffer overflow, drop_message[89] pre_index[463] current_index[552] W0107 15:42:21.469666 11846 channel_buffer.h:70] channel[/apollo/perception/obstacles] read buffer overflow, drop_message[144] pre_index[758] current_index[902] W0107 15:42:21.469736 11833 channel_buffer.h:70] channel[/apollo/planning] read buffer overflow, drop_message[144] pre_index[759] current_index[903] W0107 15:42:21.469765 11836 channel_buffer.h:70] channel[/apollo/sensor/gnss/raw_data] read buffer overflow, drop_message[157] pre_index[831] current_index[988] W0107 15:42:21.469818 11843 channel_buffer.h:70] channel[/apollo/sensor/gnss/odometry] read buffer overflow, drop_message[143] pre_index[759] current_index[902] W0107 15:42:21.469856 11838 channel_buffer.h:70] channel[/tf] read buffer overflow, drop_message[143] pre_index[759] current_index[902] W0107 15:42:21.469930 11845 channel_buffer.h:70] channel[/apollo/sensor/gnss/ins_stat] read buffer overflow, drop_message[157] pre_index[831] current_index[988] W0107 15:42:21.470189 11840 channel_buffer.h:70] channel[/apollo/sensor/livox/lidar_360/CustomMsg] read buffer overflow, drop_message[110] pre_index[628] current_index[738] W0107 15:42:21.473907 11839 channel_buffer.h:70] channel[/apollo/sensor/livox/lidar_n/CustomMsg] read buffer overflow, drop_message[121] pre_index[657] current_index[778] W0107 15:42:47.005381 11844 channel_buffer.h:70] channel[/apollo/control] read buffer overflow, drop_message[137] pre_index[9987] current_index[10124] W0107 15:42:47.005589 11834 channel_buffer.h:70] channel[/apollo/sensor/gnss/cust/imu] read buffer overflow, drop_message[55] pre_index[23091] current_index[23146]

IvDmNe commented 8 months ago

Maybe you have reached maximum writing speed on your disk. As I remember, writing 1920x1080 images at 20 FPS generates around 5 x 20 = 100 MB/s. I'd advise you to check speed limits on your disk. Also, you can record compressed images instead of raw ones

xhzzhang commented 8 months ago

Thanks for your reply, I write to local 32G eMMC, I test speed has 500MB/s: nvidia@nvidia-desktop:~$ dd if=/dev/zero of=1.iso bs=4096 count=102400 102400+0 records in 102400+0 records out 419430400 bytes (419 MB, 400 MiB) copied, 0.810827 s, 517 MB/s

kaoya722 commented 8 months ago

How about your msg frequency? Is there abnormal with cyber_recorder deactivate? My system is Ubuntu1804 on x86_64. Same info "read buffer overflow" is presented in log, but no influence for message frequency (about 10Hz).

xhzzhang commented 8 months ago

I tested serval times to cyber_record with camera image channel(1080p, 10fps), everytime got 'read buffer overflow', sometimes lost 2 seconds message, somtimes lost 10 seconds message, now I can't record camera image channel.

kaoya722 commented 8 months ago

Did you use "build_opt_gpu" option to accelerate when compile?

xhzzhang commented 8 months ago

In orin, we use build_opt_nvidia to compile, Is it use build_opt_gpu to compile for jetson orin?

IvDmNe commented 8 months ago

Maybe increasing log level to Debug will get you some useful info. Also, ensure that no other processes are writing something big to the disk at the same time. I still guess that it can be caused by disk.

And do you see stable fps on camera topic in cyber_monitor while recording? To be sure that driver outputs the desired frequency

xhzzhang commented 8 months ago

Thanks for you cliealy reply, later will check it, then will reply the result.

zhanglonggao commented 8 months ago

I have also encountered the same issue. Have you solved it?

xhzzhang commented 8 months ago

not yet, just some debug message, maybe need apollo team to fix it.

daohu527 commented 8 months ago

According to the error message, the message was not read in time, resulting in message loss. It can be judged that the message was not written in time.

https://github.com/ApolloAuto/apollo/blob/3859bee50e92d81e94f64ec4a3b3d4241978e8f5/cyber/tools/cyber_recorder/recorder.cc#L253

I checked the code writing is asynchronous. Therefore, we need to confirm where the bottleneck is. We also encountered the problem of insufficient write performance.

xhzzhang commented 6 months ago

Hello, @daohu527 Is anything update about this issue?

leizi111 commented 3 months ago

你好,我看你录制的内容里面有mid360的数据,想请教一下在apollo中接入mid-360中有没有开源的驱动包,我看了livox官方sdk里面只有livox_hub的,没有livox_lidar的