bit-bots / bitbots_lowlevel

MIT License
4 stars 3 forks source link

Problems after startup and error of 'malloc_consolidate(): invalid chunk size' #113

Closed Bozenton closed 1 year ago

Bozenton commented 2 years ago

Hi there, thank you very much for your sharing of the projects!

What I'm trying to do

I am using your package bitbots_ros_control to control my servo. I use three kinds of Dynamixel:

And I am using the ROS1 version, to be exactly, I use this version: https://github.com/bit-bots/bitbots_lowlevel/tree/125793cd2924db4069c65ee2b620e1f6c78ded0c I start with launching the file ros_control.launch, and then I use send_joint_command.py to send commands and hope that each servo can go to position 0 smoothly. However, sometimes some servos would move rapidly when receiving the first command after startup. It seems that those lines does not work: https://github.com/bit-bots/bitbots_lowlevel/blob/125793cd2924db4069c65ee2b620e1f6c78ded0c/bitbots_ros_control/src/servo_bus_interface.cpp#L242-L250

What's more, when I keep sending commands, the node of ros_control might run into some problems:

malloc_consolidate(): invalid chunk size
[ros_control-2] process has died [pid 7994, exit code -6, cmd /home/xiaobaige/brbots-2023/devel/lib/brbots_ros_control/node __name:=ros_control __log:=/home/xiaobaige/.ros/log/22dcd306-4d5e-11ed-9bf5-d0abd5e70009/ros_control-2.log].
log file: /home/xiaobaige/.ros/log/22dcd306-4d5e-11ed-9bf5-d0abd5e70009/ros_control-2*.log

or

corrupted size vs. prev_size
[ros_control-2] process has died [pid 3643, exit code -6, cmd /home/xiaobaige/brbots-2023/devel/lib/brbots_ros_control/node __name:=ros_control __log:=/home/xiaobaige/.ros/log/0353aa76-4d5b-11ed-9bf5-d0abd5e70009/ros_control-2.log].
log file: /home/xiaobaige/.ros/log/0353aa76-4d5b-11ed-9bf5-d0abd5e70009/ros_control-2*.log

There is not too much information about this error and I don't know how to address it.

What I've tried

I have already tried to set the latency_timer but it didn't solve the problems.

Additional context

I am using just one USB port and the baud of servos is 2M. And I notice that you didn't use MX-28-2 in your projects (I use Position mode so I think MX-28-2 may not cause too much trouble).

Hope for your help sincerely.

SammyRamone commented 2 years ago

Hi can you post the content of your config file (the equivalent to https://github.com/bit-bots/bitbots_lowlevel/blob/125793cd2924db4069c65ee2b620e1f6c78ded0c/bitbots_ros_control/config/wolfgang.yaml)?

We never used the MX-28 because we were not able to flash the new firmware version with protocol 2 on our old MX-28 servos. Therefore, I don't know if there is some issues with them, but I would not expect any issues. Still, you could try to only control the MX-64 and MX-106 servos to make sure that this is not the source of the problem.

We did not have issues with the robot moving rapidly during startup. The idea of the lines of code that you mentioned, is that the software first reads the current position of the servos and remembers this as current goal position. This is necessary as the software will write the goal positions of all servos at each cycle. During startup it will not know any valid goal position (as it did not yet received any ROS message with it), therefore it would send the goal 0 to all servos (which would lead to them moving to this position). The implementation should only make the servos move once the first ROS message with goal positions is received. If I understand your issue correctly, the abrupt moving of the servos is only happening after you use the send_joint_command.py. Therefore, this code is not the issue.

The send_joint_command.py uses a velocity of 5 to go to the goal position https://github.com/bit-bots/bitbots_lowlevel/blob/125793cd2924db4069c65ee2b620e1f6c78ded0c/bitbots_ros_control/scripts/send_joint_command.py#L36 You can change this to move slower. You can also adapt the acceleration value (currently set to maximum) to limit accelerations.

You also probably need to adapt the send_joint_command.py to the names of the joints of your robots. Did you do this? Otherwise this would maybe explain the "malloc_consolidate(): invalid chunk size" and "corrupted size vs. prev_size" errors that you get.

What you can also try is to only launch the ros_control without sending commands. Then verify that you correctly read all servos with correct positions (rostopic echo /joint_states). If this fails, there is probably something wrong in your config.

Bozenton commented 2 years ago

Thank you very much for your reply! I have tried your methods in your suggestions. And I did some experiment and got some conclusions, but the problem is still unsolved.

What I've tried

1. Substitute MX-64 for MX-28

All the servos in my robot now are MX-106 and MX-64, but the error of 'malloc_consolidate(): invalid chunk size' still exist. So I think you are right that my problems are not caused by MX-28.

2. Try to deal with the abrupt abrupt moving of the servos

Yes you're right, the abrupt moving of the servos is only happening after I use the send_joint_command.py (without any arg). After several trials, I found some strange phenomena:

  1. I tried a group of new servos (MX-106 and MX-64) and used the same code. I found that after startup if I give 0 goal position in joint command like what you did: https://github.com/bit-bots/bitbots_lowlevel/blob/125793cd2924db4069c65ee2b620e1f6c78ded0c/bitbots_ros_control/scripts/send_joint_command.py#L45 my servos (which were not in 0.0 position before receiving the command) would move suddenly for some angles (like 20 degrees) and then move very very slowly to 0.0 position. But if I gave a non-zero position after startup, or gave any goal position (including 0 position) if there had already been some commands sent, those servos would behave normally.
  2. But as for my old (not too old, just compared with the above new servos) servos, neither 0.0 goal position or non-zero goal position would make them behave well, there was still abrupt moving after receiving the first command.

I am wondering whether these phenomena have something to do with these lines: https://github.com/bit-bots/bitbots_lowlevel/blob/125793cd2924db4069c65ee2b620e1f6c78ded0c/bitbots_ros_control/src/servo_bus_interface.cpp#L308-L311

My understanding is: since goal position is sent only when it is not the same with last goal position, and I found that after startup the goal position will be set as 0.0 (I add a ROS_INFO after syncWritePosition();in line 309 and print the goal position in command line), so if I use 0.0 goal position right after startup, the goal position still equals last position and the condition of if can not be satisfied, thus only commands of velocity, acceleration and effort are sent to the servos, which, I think, may cause the phenomenon 1. But this explanation can not explain what happened in phenomenon 2 because even if I gave non-zero goal position there was still abrupt moving.

3. Change the number of USB ports.

At first I used only one USB port to control 18 servos. Later I used two ports but the error of 'malloc_consolidate(): invalid chunk size' still exist.

4. Change the number of servos.

I tried to use less servos. I found that

5. Change the frequency of sending commands.

I tried 18 servos controlled by 2 USB ports, and used send_joint_command.py with arg --once to send commands manually: https://github.com/bit-bots/bitbots_lowlevel/blob/125793cd2924db4069c65ee2b620e1f6c78ded0c/bitbots_ros_control/scripts/send_joint_command.py#L78

I found that if the interval between two sending are long enough (for example, 10 sec), the error of 'malloc_consolidate(): invalid chunk size' and 'corrupted size vs. prev_size' would not appear (at least during my experiments). But if I sent it more frequently (like once per sec), one of those error would appear after several sending (by the way, there was still abrupt moving).

What's more

I found that now your repo does not use dynamixel_controller any more. I wonder why you give it up and if there would be some relationship between it and my bugs.

Overview of our hardware arrangement:

graph LR
A(USB port of PC) --> B(U2D2 bought from Robotis) --> C(splitter) --> D1(servo 1)
C--> D2(servo 2)
C --> D3(...)
C--> D4(servo 18)
E(6s LiPo battery) --> F(24V-12V converter) --> C

Thank you very much for your attention and hope for your help sincerely:pray:. (By the way I am not native English speaker and I hope that my expression would not cause too much misunderstanding:joy:)

Additional files

Here are my forked code and hope it might help. brbots_ros_control.zip

SammyRamone commented 2 years ago

Okay nice set of experiments. Maybe we can find the problem easier with this information.

1. Substitute MX-64 for MX-28

All the servos in my robot now are MX-106 and MX-64, but the error of 'malloc_consolidate(): invalid chunk size' still exist. So I think you are right that my problems are not caused by MX-28.

Okay, so it is as expected, but still good to know.

2. Try to deal with the abrupt abrupt moving of the servos

My understanding is: since goal position is sent only when it is not the same with last goal position, and I found that after startup the goal position will be set as 0.0 (I add a ROS_INFO after syncWritePosition();in line 309 and print the goal position in command line), so if I use 0.0 goal position right after startup, the goal position still equals last position and the condition of if can not be satisfied, thus only commands of velocity, acceleration and effort are sent to the servos, which, I think, may cause the phenomenon 1. But this explanation can not explain what happened in phenomenon 2 because even if I gave non-zero goal position there was still abrupt moving.

Ah yes good catch! That could lead to an issue during start up. Since we no longer maintain the ROS1 version of the code, you would need to fix it on your own fork, but it is easy to do. Just add some boolean variable to see if you have initialized the servos yet. We will probably also need to fix it for our ROS2 version.

Do all of your servos have the newest firmware version? Maybe that makes the difference between them (use the Dynamixel Wizard from Robotis to see which version it is and to upgrade it). We have all of our servos on the newest firmware version, so this could be the reason why we did not encounter this issue. Also we usually start the robot with a initial pose that has non 0 joint positions.

3. Change the number of USB ports.

At first I used only one USB port to control 18 servos. Later I used two ports but the error of 'malloc_consolidate(): invalid chunk size' still exist.

I also would expect that this does not change anything, but good to know.

4. Change the number of servos.

I tried to use less servos. I found that

* if I used 16 servos, there would seldom be the error of 'malloc_consolidate';

* if I used 17 servos, sometimes the error would appear;

* if I used all the servos (18), then it depended on the frequency of sending commands, which is described in the next subsection.

Okay, that is a bit weird. It sounds a bit like the source for the malloc error is maybe related to a race condition. Since the time it takes to write to the servos depends on the number, it may change the behavior. On the other hand, as far as I understand, you send always the same goal position, so it would only really write this to the bus once.

5. Change the frequency of sending commands.

I tried 18 servos controlled by 2 USB ports, and used send_joint_command.py with arg --once to send commands manually:

https://github.com/bit-bots/bitbots_lowlevel/blob/125793cd2924db4069c65ee2b620e1f6c78ded0c/bitbots_ros_control/scripts/send_joint_command.py#L78

I found that if the interval between two sending are long enough (for example, 10 sec), the error of 'malloc_consolidate(): invalid chunk size' and 'corrupted size vs. prev_size' would not appear (at least during my experiments). But if I sent it more frequently (like once per sec), one of those error would appear after several sending (by the way, there was still abrupt moving).

That is also super weird. We usually send 20 joint goals with >500Hz without any issues. It sounds like some thread is running really slow, but this does not make sense.

What's more

I found that now your repo does not use dynamixel_controller any more. I wonder why you give it up and if there would be some relationship between it and my bugs.

When we switched from ROS 1 to ROS 2 the ros_control package (on which this is based) was not fully supported in ROS 2 yet. Therefore, our ROS2 version does currently not use the ros_control framework and therefore there was no need for the dynamixel_controller. But we did not have any issues with it before. It is only due to the switch to ROS 2.

Overview of our hardware arrangement:

That looks fine and any communication issue with the servos should not lead to a malloc issue. We did not use the U2D2 but it has a very similar FTDI usb-to-serial chip like the one that we use in our custom board. Therfore, I don't think that this is the issue.

Thank you very much for your attention and hope for your help sincerelypray. (By the way I am not native English speaker and I hope that my expression would not cause too much misunderstandingjoy)

Your welcome. Your English is very understandable, don't worry.

Additional files

Here are my forked code and hope it might help. brbots_ros_control.zip

The only real difference that I spot is, that we define in the config file all mounting_offsets and joint_offset explicitly as 0. But if it is not defined it will be automatically set to 0 by the code. So this should be no problem. You could try to deactivate read_velocity, read_effort and read_volt_temp to try out if this changes the issue but I don't think that it will help. The problem seems to be related to getting commands and not reading the servos. What you could try is to just directly return in this method (so don't do any actual writing to the bus). https://github.com/bit-bots/bitbots_lowlevel/blob/e1c46ff2f1651a02b03facb598ea844220f0fcc6/bitbots_ros_control/src/servo_bus_interface.cpp#L273 Then you could see if the malloc issue is comming from the ROS side (due to the callbacks) or from the dynamixel side (due to writing on the bus).

This is also probably not the issue, but did you make sure that you used the other packages described in https://github.com/bit-bots/bitbots_lowlevel/blob/ros1/bitbots_ros_control/README.md also with the version tagged as ROS1?

You can also try to include backwards_ros to get a better error message. This often provides better debug information https://github.com/pal-robotics/backward_ros Otherwise, the only further option would be to use typical C++ tools like Valgrind and GDB. http://wiki.ros.org/roslaunch/Tutorials/Roslaunch%20Nodes%20in%20Valgrind%20or%20GDB

Bozenton commented 2 years ago

Thanks for your suggestions! In the past few days I have tried what you've said and learned a lot. Recently I found that maybe there is something wrong with our hardware -- the board of 24V-12V converter. The input voltage would fluctuate when the servo was loaded, which, I supposed, might result in those weird errors. We modified it and now the error does not appear (at least temporarily). I will keep testing and if everything goes well in the future (hope so :pray: ), I would close this issue. Thanks a lot for your sharing of the repository and patient help. I learned quite a lot about ROS skills here.

SammyRamone commented 2 years ago

Okay, this would make sense as the servos would "reboot" and loose the information that was send to them. By the way, we build this board that provides a constant voltage to our servos. Might be of interest to you. You're welcome :slightly_smiling_face:

Bozenton commented 2 years ago

This problem still exists :cry:

Can I have a look at your configuration of servos (MX-64 and MX-106) in Dynamixel Wizard? Maybe there is something wrong with my configuration:

2022-11-03 14-50-00 的屏幕截图 2022-11-03 14-50-10 的屏幕截图 2022-11-03 14-50-14 的屏幕截图

SammyRamone commented 2 years ago

You have set a homing offset for the servo. The manual says that present_position = actual_position + homing offset (see link). Maybe that leads to this movements because the software will use the present position on startup for the first goal :shrug:

Bozenton commented 2 years ago

Sorry I did not express myself clearly :cry: I mean the error of 'malloc_consolidate(): invalid chunk size' and 'corrupted size vs. prev_size' still exists.

SammyRamone commented 2 years ago

I am a bit clueless about this and have no further ideas where this problem could come from. I think you need to debug the malloc issue with the standard C++ tools to see where it comes from.

Bozenton commented 1 year ago

Sorry for late reply.

We changed the lib of dynamixel-workbench to version 2.0 and modified some codes in lowlevel. Then the error disappears and after several days of testing, I believe my problem is solved. Maybe those errors are caused by my configuration of the servos, my usage of U2D2, or something weird. But, at least now, they are not important. Thanks for your help and now I think you can close this issue. :smiley: