RobotLocomotion / drake

Model-based design and verification for robotics.
https://drake.mit.edu
Other
3.33k stars 1.26k forks source link

Python and CPP Split Versioning #15760

Closed gbalke closed 3 years ago

gbalke commented 3 years ago

I ran into a bug where my cpp code was loading one version of drake's libs (from a binary) and python was loading another (from a local build). This was causing some very strange behavior where world_frame_id was different between the python and cpp code. I feel like while this may be a rare issue, it might be worth putting a mechanism in place that can check to make sure this isn't occurring. Just thought I should put it out there in case anyone else might run into the same issue!

A minimal demo isn't very small or portable so I'm going to best describe how to reproduce the issue and give an example output.

Environment

To reproduce, install drake from binary and then build drake separately and set the python path to reference the pybindings built there.

In my case, my .bashrc has the following before:

export PYTHONPATH=/home/ubuntu/drake/drake-build/install/lib/python3.8/site-packages:${PYTHONPATH}
export drake_DIR=/home/ubuntu/drake/drake-build/install/lib/cmake/drake

and started working when I switched to:

export PYTHONPATH=/home/ubuntu/drake-binary/drake/lib/python3.8/site-packages:${PYTHONPATH}
export drake_DIR=/home/ubuntu/drake-binary/drake/lib/cmake/drake

I believe the former directs python to load the built libraries and for cmake to use the built libraries but, at runtime, something in my system configuration was directing the CPP code to load the drake-binary libs instead of the built libs.

In the latter, python loads the same libs as the CPP executable so there's no independent behavior.

An alternative theory is that my application uses pybind11 to make its own CPP bindings and that somehow linked to another version of drake.

Example Error

The exact issue was the cpp code would return the max frame id +1 as the world frame id. In the case of loading an iiwa, world_frame_id would return as 1 in python (as it should) and 13 in the cpp code even though there was no registered tf 13 in the scene.

My scene is declared from within the python script and then calls a publisher LeafSystem written in CPP that retrieves the inspector for the scene from the port with the given context.

From CPP inspector

World frame id 13                                           
Getting child frame name for id 11                                 
Publish frame kuka_iiwa0_0/iiwa_link_ee_kuka
Getting child frame name for id 10                                 
Publish frame kuka_iiwa0_0/iiwa_link_7    
Getting child frame name for id 8                                  
Publish frame kuka_iiwa0_0/iiwa_link_5     
Getting child frame name for id 7                                  
Publish frame kuka_iiwa0_0/iiwa_link_4     
Getting child frame name for id 6                                  
Publish frame kuka_iiwa0_0/iiwa_link_3      
Getting child frame name for id 5                                  
Publish frame kuka_iiwa0_0/iiwa_link_2     
Getting child frame name for id 4                                  
Publish frame kuka_iiwa0_0/iiwa_link_1     
Getting child frame name for id 3                                  
Publish frame kuka_iiwa0_0/iiwa_link_0    
Getting child frame name for id 9                                  
Publish frame kuka_iiwa0_0/iiwa_link_6    
Getting child frame name for id 2                                  
Publish frame kuka_iiwa0_0/base           
Getting child frame name for id 12                                 
Publish frame kuka_iiwa0_0/iiwa_link_ee    
Getting child frame name for id 1                                  
Publish frame world     

From Python:

Scene graph world frame id <FrameId value=1>
jwnimmer-tri commented 3 years ago

While the root cause in this issue is most likely to be a configuration error in the environment variables, we have also recently changed how we define the global storage for the Identifier counters, which will probably resolve the symptom here in any case. I'll say let's close this as fixed by #15857.