autonomousvision / transfuser

[PAMI'23] TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving; [CVPR'21] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving
MIT License
1.17k stars 192 forks source link

> Unable to find a valid cuDNN algorithm to run convolution #134

Closed haodong2000 closed 1 year ago

haodong2000 commented 1 year ago

RTX3060 with 6BG memory, ubuntu 22.04

in environment.yml I changed cudatoolkit=10.2 with cudatoolkit=11.3 in team_code_transfuser/requirements.txt I changed

torch==1.11.0
torchaudio==0.11.0
torchvision==0.12.0

with

--extra-index-url https://download.pytorch.org/whl/cu113
torch==1.12.0+cu113
--extra-index-url https://download.pytorch.org/whl/cu113
torchaudio==0.12.0
--extra-index-url https://download.pytorch.org/whl/cu113
torchvision==0.13.0+cu113

then I typed:

pip install torch-scatter -f https://data.pyg.org/whl/torch-1.12.0+cu113.html
pip install mmcv-full==1.6.0 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.12.0/index.html

however, after I started carla server by ./CarlaUE4.sh --world-port=2000 -prefernvidia, I run ./leaderboard/scripts/local_evaluation.sh, and then I met with Stopping the route, the agent has crashed: > Unable to find a valid cuDNN algorithm to run convolution

here is the log of one iteration:

(tfuse) haodong@rog:~/Projects/Ali_DAMO_AVTest/carla_apollo/transfuser$ ./leaderboard/scripts/local_evaluation.sh /home/haodong/Projects/Ali_DAMO_AVTest/carla_apollo/transfuser/carla /home/haodong/Projects/Ali_DAMO_AVTest/carla_apollo/transfuser
/home/haodong/Projects/Ali_DAMO_AVTest/carla_apollo/transfuser/leaderboard/leaderboard/leaderboard_evaluator_local.py:89: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if LooseVersion(dist.version) < LooseVersion('0.9.10'):

========= Preparing RouteScenario_15 (repetition 0) =========
> Setting up the agent
/home/haodong/Projects/Ali_DAMO_AVTest/carla_apollo/transfuser/model_ckpt/transfuser/model_seed1_39.pth
> Loading the world
Base transform is blocking objects  Transform(Location(x=252.777176, y=-67.988762, z=0.600000), Rotation(pitch=0.000000, yaw=541.393188, roll=0.000000))
Base transform is blocking objects  Transform(Location(x=252.777176, y=-67.988762, z=0.600000), Rotation(pitch=0.000000, yaw=541.393188, roll=0.000000))
Base transform is blocking objects  Transform(Location(x=252.801483, y=-68.988464, z=0.600000), Rotation(pitch=0.000000, yaw=541.393188, roll=0.000000))
Base transform is blocking objects  Transform(Location(x=252.801483, y=-68.988464, z=0.600000), Rotation(pitch=0.000000, yaw=541.393188, roll=0.000000))
Base transform is blocking objects  Transform(Location(x=252.825806, y=-69.988167, z=0.600000), Rotation(pitch=0.000000, yaw=541.393188, roll=0.000000))
Base transform is blocking objects  Transform(Location(x=252.825806, y=-69.988167, z=0.600000), Rotation(pitch=0.000000, yaw=541.393188, roll=0.000000))
Base transform is blocking objects  Transform(Location(x=252.825806, y=-69.988167, z=0.600000), Rotation(pitch=0.000000, yaw=541.393188, roll=0.000000))
Base transform is blocking objects  Transform(Location(x=252.850113, y=-70.987869, z=0.600000), Rotation(pitch=0.000000, yaw=541.393188, roll=0.000000))
Base transform is blocking objects  Transform(Location(x=252.850113, y=-70.987869, z=0.600000), Rotation(pitch=0.000000, yaw=541.393188, roll=0.000000))
Base transform is blocking objects  Transform(Location(x=252.874435, y=-71.987579, z=0.600000), Rotation(pitch=0.000000, yaw=541.393188, roll=0.000000))
Base transform is blocking objects  Transform(Location(x=252.874435, y=-71.987579, z=0.600000), Rotation(pitch=0.000000, yaw=541.393188, roll=0.000000))
Base transform is blocking objects  Transform(Location(x=252.874435, y=-71.987579, z=0.600000), Rotation(pitch=0.000000, yaw=541.393188, roll=0.000000))
Base transform is blocking objects  Transform(Location(x=252.898743, y=-72.987282, z=0.600000), Rotation(pitch=0.000000, yaw=541.393188, roll=0.000000))
Base transform is blocking objects  Transform(Location(x=252.898743, y=-72.987282, z=0.600000), Rotation(pitch=0.000000, yaw=541.393188, roll=0.000000))
Base transform is blocking objects  Transform(Location(x=252.923065, y=-73.986984, z=0.600000), Rotation(pitch=0.000000, yaw=541.393188, roll=0.000000))
Base transform is blocking objects  Transform(Location(x=252.923065, y=-73.986984, z=0.600000), Rotation(pitch=0.000000, yaw=541.393188, roll=0.000000))
Base transform is blocking objects  Transform(Location(x=252.923065, y=-73.986984, z=0.600000), Rotation(pitch=0.000000, yaw=541.393188, roll=0.000000))
Base transform is blocking objects  Transform(Location(x=252.947372, y=-74.986694, z=0.600000), Rotation(pitch=0.000000, yaw=541.393188, roll=0.000000))
Base transform is blocking objects  Transform(Location(x=252.947372, y=-74.986694, z=0.600000), Rotation(pitch=0.000000, yaw=541.393188, roll=0.000000))
Base transform is blocking objects  Transform(Location(x=252.971680, y=-75.986389, z=0.600000), Rotation(pitch=0.000000, yaw=541.393188, roll=0.000000))
Skipping scenario 'Scenario3' due to setup error: Error: Unable to spawn vehicle walker.pedestrian.0015 at Transform(Location(x=252.971680, y=-75.986389, z=0.600000), Rotation(pitch=0.000000, yaw=541.393188, roll=0.000000))
No more spawn points to use. Spawned 266 actors out of 500
> Running the route

Stopping the route, the agent has crashed:
> Unable to find a valid cuDNN algorithm to run convolution

Traceback (most recent call last):
  File "/home/haodong/Projects/Ali_DAMO_AVTest/carla_apollo/transfuser/leaderboard/leaderboard/scenarios/scenario_manager_local.py", line 152, in _tick_scenario
    ego_action = self._agent()
  File "/home/haodong/Projects/Ali_DAMO_AVTest/carla_apollo/transfuser/leaderboard/leaderboard/autoagents/agent_wrapper_local.py", line 84, in __call__
    return self._agent()
  File "/home/haodong/Projects/Ali_DAMO_AVTest/carla_apollo/transfuser/leaderboard/leaderboard/autoagents/autonomous_agent.py", line 115, in __call__
    control = self.run_step(input_data, timestamp)
  File "/home/haodong/Applications/Miniconda/Miniconda/envs/tfuse/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/haodong/Projects/Ali_DAMO_AVTest/carla_apollo/transfuser/team_code_transfuser/submission_agent.py", line 299, in run_step
    forced_move=is_stuck, debug=self.config.debug, rgb_back=self.rgb_back)
  File "/home/haodong/Projects/Ali_DAMO_AVTest/carla_apollo/transfuser/team_code_transfuser/model.py", line 695, in forward_ego
    features, image_features_grid, fused_features = self._model(rgb, lidar_bev, ego_vel)
  File "/home/haodong/Applications/Miniconda/Miniconda/envs/tfuse/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/haodong/Projects/Ali_DAMO_AVTest/carla_apollo/transfuser/team_code_transfuser/transfuser.py", line 134, in forward
    image_features = self.image_encoder.features.conv1(image_tensor)
  File "/home/haodong/Applications/Miniconda/Miniconda/envs/tfuse/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/haodong/Applications/Miniconda/Miniconda/envs/tfuse/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/haodong/Applications/Miniconda/Miniconda/envs/tfuse/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/haodong/Projects/Ali_DAMO_AVTest/carla_apollo/transfuser/leaderboard/leaderboard/leaderboard_evaluator_local.py", line 351, in _load_and_run_scenario
    self.manager.run_scenario()
  File "/home/haodong/Projects/Ali_DAMO_AVTest/carla_apollo/transfuser/leaderboard/leaderboard/scenarios/scenario_manager_local.py", line 136, in run_scenario
    self._tick_scenario(timestamp)
  File "/home/haodong/Projects/Ali_DAMO_AVTest/carla_apollo/transfuser/leaderboard/leaderboard/scenarios/scenario_manager_local.py", line 159, in _tick_scenario
    raise AgentError(e)
leaderboard.autoagents.agent_wrapper_local.AgentError: Unable to find a valid cuDNN algorithm to run convolution
> Stopping the route

========= Results of RouteScenario_15 (repetition 0) ------ FAILURE =========

╒═════════════════════════════════╤═════════════════════╕
│ Start Time                      │ 2022-12-22 23:46:53 │
├─────────────────────────────────┼─────────────────────┤
│ End Time                        │ 2022-12-22 23:46:54 │
├─────────────────────────────────┼─────────────────────┤
│ Duration (System Time)          │ 0.94s               │
├─────────────────────────────────┼─────────────────────┤
│ Duration (Game Time)            │ 0.05s               │
├─────────────────────────────────┼─────────────────────┤
│ Ratio (System Time / Game Time) │ 0.053               │
╘═════════════════════════════════╧═════════════════════╛

╒═══════════════════════╤═════════╤═════════╕
│ Criterion             │ Result  │ Value   │
├───────────────────────┼─────────┼─────────┤
│ RouteCompletionTest   │ FAILURE │ 0.0 %   │
├───────────────────────┼─────────┼─────────┤
│ OutsideRouteLanesTest │ SUCCESS │ 0 %     │
├───────────────────────┼─────────┼─────────┤
│ CollisionTest         │ SUCCESS │ 0 times │
├───────────────────────┼─────────┼─────────┤
│ RunningRedLightTest   │ SUCCESS │ 0 times │
├───────────────────────┼─────────┼─────────┤
│ RunningStopTest       │ SUCCESS │ 0 times │
├───────────────────────┼─────────┼─────────┤
│ InRouteTest           │ SUCCESS │         │
├───────────────────────┼─────────┼─────────┤
│ AgentBlockedTest      │ SUCCESS │         │
├───────────────────────┼─────────┼─────────┤
│ Timeout               │ SUCCESS │         │
╘═══════════════════════╧═════════╧═════════╛

> Registering the route statistics
haodong2000 commented 1 year ago

Any help will be appreciated!!!

haodong2000 commented 1 year ago

the reason why I ran ./CarlaUE4.sh --world-port=2000 -prefernvidia instead of ./CarlaUE4.sh --world-port=2000 -opengl, is that I always met with crash error once the local_evaluation.sh executed. here is the log:

haodong@rog:~/Projects/Ali_DAMO_AVTest/carla_apollo/transfuser/carla$ ./CarlaUE4.sh --world-port=2000 -opengl
4.24.3-0+++UE4+Release-4.24 518 0
Disabling core dumps.
../src/intel/isl/isl.c:2220: FINISHME: ../src/intel/isl/isl.c:isl_surf_supports_ccs: CCS for 3D textures is disabled, but a workaround is available.
LowLevelFatalError [File:Unknown] [Line: 3762] 
Failed to link program [Program V_3AA5F7CCAD2B351344BE53DDA22E8BAEBA78720A_P_ABADFE631118A2CF0719C49D3F936DABF0CE8865]. Current total programs: 268, precompile: 0
Signal 11 caught.
Malloc Size=65538 LargeMemoryPoolOffset=65554 
CommonUnixCrashHandler: Signal=11
Malloc Size=65535 LargeMemoryPoolOffset=131119 
Malloc Size=147680 LargeMemoryPoolOffset=278816 
Engine crash handling finished; re-raising signal 11 for the default handler. Good bye.
Segmentation fault (core dumped)

also, to avoid CUDA memory error, I removed two of three .pth pretrained model before executing local_evaluation.sh

Kait0 commented 1 year ago

Hm unclear to me what went wrong here. Your setup looks correct. Is there maybe also a different version of pytorch installed that is being used and crashes? What does pip --list return after you activated the tfuse environment?

haodong2000 commented 1 year ago

hello, here is the output:

(tfuse) haodong@rog:~/Projects/Ali_DAMO_AVTest/carla_apollo/transfuser$ python -m pip list
Package              Version
-------------------- ----------------
absl-py              0.9.0
addict               2.4.0
attrs                19.3.0
backcall             0.1.0
bleach               3.1.4
cachetools           4.0.0
certifi              2018.8.24
chardet              3.0.4
click                7.1.2
colorama             0.4.6
configparser         4.0.2
cycler               0.10.0
decorator            4.4.2
defusedxml           0.6.0
dictor               0.1.5
diskcache            5.3.0
docker-pycreds       0.4.0
elementpath          1.3.3
entrypoints          0.3
ephem                3.7.7.1
filelock             3.3.2
filterpy             1.4.5
future               0.18.2
gitdb                4.0.2
GitPython            3.1.0
google-auth          1.11.3
google-auth-oauthlib 0.4.1
gql                  0.2.0
graphql-core         1.1
grpcio               1.27.2
idna                 2.9
imageio              2.8.0
imgaug               0.4.0
importlib-metadata   1.6.0
ipykernel            5.2.0
ipython              7.13.0
ipython-genutils     0.2.0
ipywidgets           7.5.1
jedi                 0.16.0
Jinja2               2.11.1
joblib               0.14.1
jsonschema           3.2.0
jupyter-client       6.1.2
jupyter-core         4.6.3
kiwisolver           1.1.0
Markdown             3.2.1
MarkupSafe           1.1.1
matplotlib           3.0.3
mistune              0.8.4
mmcls                0.23.2
mmcv-full            1.6.0
mmdet                2.25.0
mmsegmentation       0.25.0
model-index          0.1.11
munkres              1.1.4
nbconvert            5.6.1
nbformat             5.0.4
networkx             2.2
notebook             6.0.3
numpy                1.18.1
nvidia-ml-py3        7.352.0
oauthlib             3.1.0
open3d               0.9.0.0
opencv-python        4.2.0.32
openmim              0.1.5
ordered-set          4.1.0
packaging            22.0
pandas               0.25.3
pandocfilters        1.4.2
parso                0.6.2
pathtools            0.1.2
pexpect              4.8.0
pickleshare          0.7.5
Pillow               7.0.0
pip                  21.2.2
prettytable          3.5.0
prometheus-client    0.7.1
promise              2.3
prompt-toolkit       3.0.5
protobuf             3.11.3
psutil               5.7.0
ptyprocess           0.6.0
py-trees             0.8.3
pyasn1               0.4.8
pyasn1-modules       0.2.8
pycocotools          2.0.6
pydot                1.4.1
pygame               2.0.1
Pygments             2.6.1
pyparsing            2.4.6
pyrsistent           0.16.0
python-dateutil      2.8.1
pytictoc             1.5.2
pytorch-lightning    0.7.1
pytz                 2019.3
PyWavelets           1.1.1
PyYAML               5.3
pyzmq                19.0.0
requests             2.23.0
requests-oauthlib    1.3.0
rsa                  4.0
scikit-image         0.16.2
scikit-learn         0.22.2.post1
scipy                1.4.1
Send2Trash           1.5.0
sentry-sdk           0.14.2
setuptools           61.2.0
Shapely              1.7.0
shortuuid            1.0.1
six                  1.14.0
smmap                3.0.1
subprocess32         3.5.4
tabulate             0.8.7
tensorboard          2.1.1
terminado            0.8.3
terminaltables       3.1.10
testpath             0.4.4
timm                 0.5.4
torch                1.12.0+cu113
torch-scatter        2.1.0+pt112cu113
torchaudio           0.12.0+cu113
torchvision          0.13.0+cu113
tornado              6.0.4
tqdm                 4.43.0
traitlets            4.3.3
typing_extensions    4.4.0
ujson                5.3.0
urllib3              1.25.8
wandb                0.8.29
watchdog             0.10.2
wcwidth              0.1.9
webencodings         0.5.1
Werkzeug             1.0.0
wheel                0.37.1
widgetsnbextension   3.5.1
xmlschema            1.0.18
yapf                 0.32.0
zipp                 3.1.0

and I noticed that the maximum GPU memory used is about 5GB/6GB, I hope 6GB is enough ^_^

haodong2000 commented 1 year ago

dear developer, I accidently solved it by type -quality-level=Low after ./CarlaUE4.sh --world-port=2000 -prefernvidia.

however, I met with new problems, I will create a new issue about it, if I can not handle it by myself.

by the way, I am super grateful about your reply!

update on 2022-12-24

type ./CarlaUE4.sh --world-port=2000 -vulkan is better, cause -quality-level=Low might lead to unusual behavior of transfuser.