UM-ARM-Lab / pytorch_kinematics

Robot kinematics implemented in pytorch
MIT License
394 stars 34 forks source link

Trying to speed up FK #26

Closed PeterMitrano closed 9 months ago

PeterMitrano commented 10 months ago

This is a pretty big PR but the main focus is on improving speed. Other changes:

powertj commented 10 months ago

When pip installing, this uses the most recent version of pytorch (2.1.0 at the time of writing). This has to be compatible with the pytorch version the user has installed. I have pytorch version 2.0.1 and get the following error when importing zpk_cpp.

import zpk_cpp
ImportError: /home/tpower/dev/research/constrained_cai/venv/lib/python3.8/site-packages/zpk_cpp.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c10ltERKNS_6SymIntEi

Solved when editing the pyproject.toml to require pytorch version 2.0.1.

powertj commented 10 months ago

Can no longer wrap chain.forward_kinematics with torch.func.vmap. The issue appears to be due to in-place operations in axis_and_angle_to_matrix in pk.cpp. Any idea how much time is saved by doing these in-place vs using torch::cat/torch.stack?

PeterMitrano commented 9 months ago

The issues Tom mentioned about the torch version are going to be a huge pain, so I'm trying to see if using torch.compile can give us comparable speed up to implementing in C++ in which case I will drop the C++ entirely.

PeterMitrano commented 9 months ago

@powertj can you try this again? I've removed the C++ code since even without that it's still faster, and I'm curious if that also fixes your vmap issues. Make sure you fully clean pytorch_kinematics and zpk out of your environment first.

powertj commented 9 months ago

@PeterMitrano when I run test_kinematics.py I get 0.003 seconds for N=1000 FK lookups for the old method and 0.048 seconds for this branch. I copied over the changes to the test script so they should be the same. Any idea why this is the case?

PeterMitrano commented 9 months ago

@PeterMitrano when I run test_kinematics.py I get 0.003 seconds for N=1000 FK lookups for the old method and 0.048 seconds for this branch. I copied over the changes to the test script so they should be the same. Any idea why this is the case?

~On my machine with a 1080ti:~ ~OLD: elapsed 0.1308739185333252s for N=1000 when parallel~ ~NEW: elapsed 0.014335598703473807s for N=1000 when parallel~

The test was being spooky because after moving a Chain to the GPU the first 2-3 calls to FK will be 100x slower.... I've updated the test to ignore these in the timing.