dotchen / LAV

(CVPR 2022) A minimalist, mapless, end-to-end self-driving stack for joint perception, prediction, planning and control.
https://dotchen.github.io/LAV/
Apache License 2.0
397 stars 68 forks source link

Unable to use the fast agent #23

Open CAS-LRJ opened 1 year ago

CAS-LRJ commented 1 year ago

Hello, thanks for the open source of this fantastic work!

I am able to use the default v2 agent. However, I encounter such error when use the fast agent.

  File "LAV/team_code_v2/lav_agent_fast.py", line 115, in setup
    self.bra_model = torch.jit.load(self.bra_model_trace_dir)
  File "miniconda3/envs/LAV-env2/lib/python3.7/site-packages/torch/jit/_serialization.py", line 162, in load
    cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
Could not setup required agent due to PytorchStreamReader failed reading zip archive: failed finding central directory

Could you please help me to solve this problem? :octocat:

CAS-LRJ commented 1 year ago

I tried to convert the torch model into TorchScript by myself. The model was successfully loaded but the agent performance is not improved. I use the following code to convert brake prediction model and segmentation model into TorchScript.

import torch
from models.rgb import RGBSegmentationModel, RGBBrakePredictionModel
input1 = torch.rand(1, 3, 288, 768).to('cuda')
input2 = torch.rand(1, 3, 192, 480).to('cuda')
brake_model = RGBBrakePredictionModel([4,10,18]).to('cuda')
brake_model.load_state_dict(torch.load('../weights/bra_v2_9.th'))
traced_bra_model = torch.jit.trace(bra_model, (input1, input2))
traced_bra_model.save('traced_bra_model_v2.pt')

seg_model = RGBSegmentationModel([4,6,7,10]).to('cuda')
input3 = torch.rand(3, 3, 288, 256).to('cuda')
seg_model.load_state_dict(torch.load('../weights/seg_1.th'))
traced_seg_model = torch.jit.trace(seg_model, input3)
traced_seg_model.save('traced_seg_model.pt')

It shows several warnings but no errors. I wonder if I am correct or I just missed something important? Btw, the hardware I am using is a laptop with I9-12900hx, 32G RAM and RTX 3080Ti

dotchen commented 1 year ago

Thanks for reporting this! Could you try again the default .pt files (updated) at your convenience? Did you see any speed difference with the fast agent on your setup? In my case (titan xp + e5-2630 v3) I consistently see 1.5-2x speedup.

CAS-LRJ commented 1 year ago

Hello, the following error occurs with updated .pt files.

Traceback (most recent call last):
  File "LAV/leaderboard/leaderboard/autoagents/autonomous_agent.py", line 115, in __call__
    control = self.run_step(input_data, timestamp)
  File "miniconda3/envs/LAV-env2/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "LAV/team_code_v2/lav_agent_fast.py", line 265, in run_step
    fused_lidar = self.infer_model.forward_paint(cur_lidar, pred_sem)
  File "LAV/team_code_v2/model_inference.py", line 46, in forward_paint
    painted_lidar = self.point_painting(cur_lidar, pred_sem)
  File "LAV/team_code_v2/model_inference.py", line 87, in point_painting
    lidar_cam = lidar_cam[valid_idx]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Do you use multiple cards in your setup?

In my setup, the default agent has nearly 0.6 simulation ratio but fast agent has ratio below 0.5 The simulation ratio I mentioned is the ratio of the simulation time to the real time.

Maybe the new CPU is strong enough to handle the point painting task. I am gonna try the TensorRT to boost the inference speed. I will update the result once I finished the experiment.

dotchen commented 1 year ago

Could you try this command and see if it works? I just tried this on my setup and it works

ROUTES=assets/routes_lav_valid.xml TEAM_AGENT=$HOME/LAV/team_code_v2/lav_agent_fast TEAM_CONFIG=$HOME/LAV/team_code_v2/config.yaml ./leaderboard/scripts/run_evaluation.sh

========= Preparing RouteScenario_0 (repetition 0) =========
> Setting up the agent
> Loading the world
 Base transform is blocking objects  Transform(Location(x=185.695465, y=257.345886, z=1.210000), Rotation(pitch=0.000000, yaw=360.039185, roll=0.000000))
 Base transform is blocking objects  Transform(Location(x=185.695114, y=257.845886, z=1.210000), Rotation(pitch=0.000000, yaw=360.039185, roll=0.000000))
 Base transform is blocking objects  Transform(Location(x=185.694778, y=258.345886, z=1.210000), Rotation(pitch=0.000000, yaw=360.039185, roll=0.000000))
 Base transform is blocking objects  Transform(Location(x=185.694443, y=258.845886, z=1.210000), Rotation(pitch=0.000000, yaw=360.039185, roll=0.000000))
 Base transform is blocking objects  Transform(Location(x=185.694092, y=259.345886, z=1.210000), Rotation(pitch=0.000000, yaw=360.039185, roll=0.000000))
 Base transform is blocking objects  Transform(Location(x=185.693756, y=259.845886, z=1.210000), Rotation(pitch=0.000000, yaw=360.039185, roll=0.000000))
Skipping scenario 'Scenario4' due to setup error: Error: Unable to spawn vehicle vehicle.diamondback.century at Transform(Location(x=185.693756, y=259.845886, z=1.210000), Rotation(pitch=0.000000, yaw=360.039185, roll=0.000000))
> Running the route
======[Agent] Wallclock_time = 2022-09-14 12:08:22.509575 / 0.0 / Sim_time = 0.05000000074505806 / 50.00000074505806x
.....
======[Agent] Wallclock_time = 2022-09-14 12:18:09.936383 / 17.505114 / Sim_time = 3.8000000566244125 / 0.21706702336248995x
======[Agent] Wallclock_time = 2022-09-14 12:18:10.129409 / 17.69814 / Sim_time = 3.8500000573694706 / 0.21752469653155299x
======[Agent] Wallclock_time = 2022-09-14 12:18:10.326214 / 17.894945 / Sim_time = 3.9000000581145287 / 0.21792646647687666x
======[Agent] Wallclock_time = 2022-09-14 12:18:10.511506 / 18.080237 / Sim_time = 3.9500000588595867 / 0.21845850805780526x
======[Agent] Wallclock_time = 2022-09-14 12:18:10.706365 / 18.275096 / Sim_time = 4.000000059604645 / 0.21886512631607125x
======[Agent] Wallclock_time = 2022-09-14 12:18:10.901973 / 18.470704 / Sim_time = 4.050000060349703 / 0.21925427455689536x
======[Agent] Wallclock_time = 2022-09-14 12:18:11.094072 / 18.662803 / Sim_time = 4.100000061094761 / 0.2196765611539492x
======[Agent] Wallclock_time = 2022-09-14 12:18:11.276407 / 18.845138 / Sim_time = 4.150000061839819 / 0.22020427006529503x
======[Agent] Wallclock_time = 2022-09-14 12:18:11.477250 / 19.045981 / Sim_time = 4.200000062584877 / 0.22050738973199357x
======[Agent] Wallclock_time = 2022-09-14 12:18:11.669667 / 19.238398 / Sim_time = 4.250000063329935 / 0.22090088594923474x
======[Agent] Wallclock_time = 2022-09-14 12:18:11.864959 / 19.43369 / Sim_time = 4.300000064074993 / 0.2212538540143935x
======[Agent] Wallclock_time = 2022-09-14 12:18:12.057615 / 19.626346 / Sim_time = 4.350000064820051 / 0.22162956035013856x

Compared to the default agent (sim/real ~ 0.15x) the fast agent I see with my setup is usually above 0.22x (to 0.25x depending on routes). But in your setup if you find the default one works faster, yes it probably means point painting on your CPU is faster than GPU + the overhead since I think torchscript the brake and segmentation models should accelerate regardless of hardware platform.

I am not using multiple cards here for inference. Curious to see if wrapping model_inference to tensorrt speeds it up!

CAS-LRJ commented 1 year ago

Error occurs

========= Preparing RouteScenario_0 (repetition 0) =========
> Setting up the agent

Could not set up the required agent:
> No CUDA GPUs are available

You choose the second GPU on your machine with CUDA_VISIBLE_DEVICES="1". According to the document of torch.jit.load, the TorchScript will be moved to devices they were saved from. My laptop only has 1 GPU, this may cause the error. Could you please save the TorchScript on the first GPU "cuda:0"?

dotchen commented 1 year ago

Oops forgot to remove the CUDA_VISIBLE_DEVICES= part in the command. Could you run it with CUDA_VISIBLE_DEVICES="0"? The torch script jit is saved to device "cuda" in a script with specified CUDA_VISIBLE_DEVICES to one gpu, so I am pretty sure it will work...

CAS-LRJ commented 1 year ago
========= Preparing RouteScenario_0 (repetition 0) =========
> Setting up the agent
weights/bra_v2_9.pt
> Loading the world
 Base transform is blocking objects  Transform(Location(x=185.695465, y=257.345886, z=1.210000), Rotation(pitch=0.000000, yaw=360.039185, roll=0.000000))
 Base transform is blocking objects  Transform(Location(x=185.695114, y=257.845886, z=1.210000), Rotation(pitch=0.000000, yaw=360.039185, roll=0.000000))
 Base transform is blocking objects  Transform(Location(x=185.694778, y=258.345886, z=1.210000), Rotation(pitch=0.000000, yaw=360.039185, roll=0.000000))
 Base transform is blocking objects  Transform(Location(x=185.694443, y=258.845886, z=1.210000), Rotation(pitch=0.000000, yaw=360.039185, roll=0.000000))
 Base transform is blocking objects  Transform(Location(x=185.694092, y=259.345886, z=1.210000), Rotation(pitch=0.000000, yaw=360.039185, roll=0.000000))
 Base transform is blocking objects  Transform(Location(x=185.693756, y=259.845886, z=1.210000), Rotation(pitch=0.000000, yaw=360.039185, roll=0.000000))
Skipping scenario 'Scenario4' due to setup error: Error: Unable to spawn vehicle vehicle.diamondback.century at Transform(Location(x=185.693756, y=259.845886, z=1.210000), Rotation(pitch=0.000000, yaw=360.039185, roll=0.000000))
> Running the route
======[Agent] Wallclock_time = 2022-09-15 14:07:05.915860 / 0.0 / Sim_time = 0.05000000074505806 / 50.00000074505806x
======[Agent] Wallclock_time = 2022-09-15 14:07:05.955299 / 0.039439 / Sim_time = 0.10000000149011612 / 2.4728603944240986x
/miniconda3/envs/LAV-env2/lib/python3.7/site-packages/torch/nn/modules/module.py:1110: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/native/BinaryOps.cpp:607.)
  return forward_call(*input, **kwargs)
======[Agent] Wallclock_time = 2022-09-15 14:07:07.288802 / 1.372942 / Sim_time = 0.15000000223517418 / 0.10917491585174205x

Stopping the route, the agent has crashed:
> CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Traceback (most recent call last):
  File "/Documents/LAV/leaderboard/leaderboard/scenarios/scenario_manager.py", line 152, in _tick_scenario
    ego_action = self._agent()
  File "/Documents/LAV/leaderboard/leaderboard/autoagents/agent_wrapper.py", line 75, in __call__
    return self._agent()
  File "/Documents/LAV/leaderboard/leaderboard/autoagents/autonomous_agent.py", line 115, in __call__
    control = self.run_step(input_data, timestamp)
  File "/miniconda3/envs/LAV-env2/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/Documents/LAV/team_code_v2/lav_agent_fast.py", line 265, in run_step
    fused_lidar = self.infer_model.forward_paint(cur_lidar, pred_sem)
  File "/Documents/LAV/team_code_v2/model_inference.py", line 46, in forward_paint
    painted_lidar = self.point_painting(cur_lidar, pred_sem)
  File "/Documents/LAV/team_code_v2/model_inference.py", line 87, in point_painting
    lidar_cam = lidar_cam[valid_idx]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Documents/LAV/leaderboard/leaderboard/leaderboard_evaluator.py", line 342, in _load_and_run_scenario
    self.manager.run_scenario()
  File "/Documents/LAV/leaderboard/leaderboard/scenarios/scenario_manager.py", line 136, in run_scenario
    self._tick_scenario(timestamp)
  File "/Documents/LAV/leaderboard/leaderboard/scenarios/scenario_manager.py", line 159, in _tick_scenario
    raise AgentError(e)
leaderboard.autoagents.agent_wrapper.AgentError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
> Stopping the route

Still with the error illegal memory access

I have also tried the TorchScript converted by myself, it works. So I wonder if it is caused by the difference of the devices (3080Ti vs Titan Xp)

dotchen commented 1 year ago

What python and pytorch+cuda versions are you using?

CAS-LRJ commented 1 year ago
python                    3.7.10               h12debd9_4    anaconda
cudatoolkit               11.3.1               h2bc3f7f_2  
pytorch                   1.11.0          py3.7_cuda11.3_cudnn8.2.0_0    pytorch
dotchen commented 1 year ago

Thanks for the info! My guess then is that the pytorch versions might have caused the discrepancy, the .pt trace file was created with pytorch 1.7.1 with cuda tools 10.2. But since you get it working by creating it yourself that should be good! I will make a note on README mentioning there could be an issue with the versions and point to this thread. Thanks again!

Kin-Zhang commented 1 year ago

how to produce such the .pt trace file?

Kin-Zhang commented 1 year ago

I didn't find the diff between v2 and v1 agent training scripts. Is that mean the two version models can use together? for example, just use v1 model on v2 also fine? ( forget this one. Since it has different on model scripts, it should not use the v1 on v2; through config v2, only seg is the same model file.

dotchen commented 1 year ago

Hello!

The training scripts are different for the bev and full lidar agent (segmentation is identical, the brake model has a different architecture). I'm working on cleaning them up for release, thanks!

dotchen commented 1 year ago

Sorry for the delay, let me know if you have run into any issues running the codes!

Yana990 commented 1 year ago

hello.I also encountered a problem with agent. After running, it displays:Could not set up the required agent:

invalid load key, 'v'. Run command : ROUTES=/home/zhangting/LAV/assets/routes_lav_valid.xml ./leaderboard/scripts/run_evaluation.sh How can I solve it?thank you

zyjwowowowo commented 6 months ago

Traceback (most recent call last): File "d:/Anaconda/photo2cartoon-master/3.py", line 4, in fa = face_alignment.FaceAlignment(face_alignment.LandmarksType.TWO_D, device='cpu',flip_input=False) File "D:\Anaconda\envs\tudui\lib\site-packages\face_alignment\api.py", line 88, in init load_file_from_url(models_urls.get(pytorch_version, default_model_urls)[network_name])) File "D:\Anaconda\envs\tudui\lib\site-packages\torch\jit_serialization.py", line 161, in load cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files) RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory what's my problem?