Evaluation consumes all memory and leads failure

dotchen / LAV

(CVPR 2022) A minimalist, mapless, end-to-end self-driving stack for joint perception, prediction, planning and control.

https://dotchen.github.io/LAV/

Apache License 2.0

397 stars 68 forks source link

Evaluation consumes all memory and leads failure #10

Open woolpeeker opened 2 years ago

woolpeeker commented 2 years ago

I start a carla docker image, and copy the LAV code into the container. The RouteScenorio1 last around 2hours and exit with the message "RuntimeError: Timeout: Agent took too long to setup" I check the program's memory usage, and it consumes all RAM 64G.

The evaluation cmd:

ROUTES=leaderboard/data/routes_testing.xml ./leaderboard/scripts/run_evaluation.sh

The carla run in a headless mode,

SDL_VIDEODRIVER=offscreen SDL_HINT_CUDA_DEVICE=0 ./CarlaUE4.sh -ResX=800 -ResY=600 -nosound -windowed -opengl

-vulkan flag causes immediate existence. LAV readme says the carla should run with -vulkan flag Could that be the problem or are there any possible clues in your mind?

dotchen commented 2 years ago

Have you pinned down the issue is actually OOM? This feels like it has something to do with opengl but I am not sure. To help more I need more info: what is your max RAM and does CARLA server throw any errors?

woolpeeker commented 2 years ago

There is no OOM error reported. The system is Ubuntu18.04 and the evaluation is running in a docker container. The max RAM of my computer is 64G, GPU is 2080Ti-12G

when the evaluation for the route0 lasts around 2hours (I don't remember the precise time) The sim_time last reported is around 500 seconds. The evaluation program report is as follows:

Carla Could not set up the required agent: 
> Timeout: Agent took too long to setup Watchdog
exception - Timeout of 59.0 seconds occurred

Then the evaluation program tries to evaluate the rest route and directly report the same error for each route.

When it is reported, I found the computer is very slow and the memory usage is up to 64G. The Carla process consumes around 20G, and the rest is used by the evaluation process. The system began to use the SWAP for memory, so I stopped the program.

The Carla server does not throw any errors.

The evaluation time for Route0 exceeds 90 mins. Do you know that is normal?

woolpeeker commented 2 years ago

I change the eval cmd to ROUTES=assets/routes_lav_valid.xml ./leaderboard/scripts/run_evaluation.sh it success to finish the route0 test, which cost about 1 hour

The memory usage is also huge, reaching max 37G. If count the carla's memory usage, it's 57G. It's close to the limit of my computer.

dotchen commented 2 years ago

Hmm I don't have really have any anything on my mind that might help you. Since you mentioned you use docker maybe try our docker image recipe and see if it makes a difference. I have uploaded it in the repo: https://github.com/dotchen/LAV/blob/main/Dockerfile

viola521 commented 1 year ago

when I run the train_full.py ,there is no OOM error reported. lat_features = features.expand(N,*features.size()).permute(1,0,2,3,4).contiguous()[typs] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.78 GiB (GPU 0; 7.79 GiB total capacity; 1.48 GiB already allocated; 2.78 GiB free; 1.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.

my computer GPU is 8G,how could I fix it?Actually ,I have reduce batch size,workers ,and try to clear gpu cache,but it also throw this error.