autonomousvision / transfuser

[PAMI'23] TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving; [CVPR'21] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving
MIT License
1.04k stars 174 forks source link

RuntimeError: A sensor took too long to send their data - Is it normal to have this message ? #216

Closed MCUBE-2023 closed 3 weeks ago

MCUBE-2023 commented 1 month ago

Dear author, Thank you bunches for your interactivity. When I was evaluating an agent (autopilot) while running Carla server with this script: ./leaderboard/scripts/local_evaluation.sh, I got this message, although the evaluation was finished after 2 hours (Please see the attached screenshot): "Error during the simulation:

A sensor took too long to send their data"

In addition, I got this message

"RuntimeError: A sensor took too long to send their data

Stopping the route"

Furthermore, Upon checking the tables on my terminal for Results of RouteScenario_0 (repetition 0), I see the word highlighted in red (FAILURE). I also see this word in the row RouteCompletionTest under the column Result, although the value of RouteCompletionTest is 70.22 %.

To this end, I have the following questions, please: 1- Is it normal to have the message "RuntimeError: A sensor took too long to send their data> Stopping the route" ? And is it normal to have the world failure even though the value of RouteCompletionTest is 70.22 % ? PS: I mean by the word "normal" the expected result when reproducing an experiment of your work.

2- Running ./leaderboard/scripts/local_evaluation.sh (on RouteScenario_0) took me 2 hours and 7 minutes. My GPU is NVIDIA Quadro RTX 6000 and I have 32 GB of RAM. What was the duration of running this experiment on your end ?

3- When running this experiment, it took me 2 hours and 7 minutes while I was seeing in real-time in Bird Eye View the egovehicle running very slowly. Is it normal to see the egovehicle moving slowly? and How can I make this egovehicle move faster ?

4- If I am going to run again this experiment (let's say run number 1), and based on my understanding of your article, I might not have exactly the same result of running the same experiment (let's say run number2) because the initialization parameters may differ from a run to another, even though it is related to the same experiment. In other words, this experiment gave me a value of 70.22 % for RouteCompletionTest, and if I run again this experiment, what will be the margin of deviation between each runs of the same experiment ?

Thanks!

3 4

Kait0 commented 1 month ago
  1. The sensors took too long error can happen but should only rarely. There are various ways in which the CARLA simulation can crash. In these cases we simply rerun the route (we have scripts to automatically do that). I released one in this repo. If this happens all the time there is some problem.

  2. Longest6 took 2-6 hours per route on one 2080ti machine if I remember correctly. Since there are 108 routes this can take a lot of time. Fortunately all routes can be evaluated in parallel. We typically evaluate with something 32x2080ti in parallel. I wrote a bit on this here. With that you can typically run an eval over night.

  3. TransFuser drives up to 14.4 km/h which is indeed slow. The reason for that is that the expert it is imitating is also driving that slow. So to change this behavior you need a new dataset with a faster expert driver. The simplest way right now is to use the TransFuser++ expert and code which drives up to 28.8 km/h. Driving much faster than that is hard because the background traffic is not driving faster than 30 km/h. The CARLA leaderboard 2.0 fixes this issue (other cars drive up to ~100km/h or so) but there are no public code bases for that yet (with experts or agents) as far as I know.

  4. Hard to say exactly but the variance is pretty large in general which is why we typically rerun experiments with 3 different seeds. This is also because the CARLA simulator is not deterministic. You can look at the standard deviations in the paper to get a feel (these are all 1 std, assuming Gaussian distribution you would need to multiply by 2 to get 95% confidence interval).

The failure message in the second image is normal, Anything below 100% is considered failure there.

MCUBE-2023 commented 1 month ago

Dear Author, Thank you for your quick reply.

I still have the following questions please: 1- Based on my understanding of the first part of your answer, I need to run local_evaluation.sh multiple times. For example, if in run1 and run2 it crashes, and then in run3 it doesn't, then can I assume that there's nothing abnormal, and now I can rely on the provided result from run 3? Is that correct? And continuing with the second part of your answer (carla_garage/evaluate_routes_slurm.py), do I need to replace the file local_evaluation.sh with evaluate_routes_slurm.py in the transfuser folder and simply run evaluate_routes_slurm.py to avoid this crash or abnormal behaviour ?

2- I need to run an experiment on a route by modifying the code of transfuser and see how far transfuser will be deterministic. To this end, imagine if I change the content of a variable x or y or z, and I want to observe the alteration of the output. In this case, I have to wait at least 2 hours each time, which will be unpractical. In this case, can I run the experiment of evaluating transfuser on a sample of the routes (for example instead of 2 hours, I want to just experiment a sample of 2 minutes) just to see if my alteration worked or no ? If that would be possible, would you please elaborate more on how doing that ? This question is crucially important to me, and thank you for considering that!

3- Does [TransFuser++] allows me to have the same visual results as TransFuser ? In other words, can I see the same qualitative examples of the expected driving behavior on the Longest6 routes for both TransFuser and TransFuser++, Or is the visual result different ? For example, can I see the same qualitative results between TransFuser and TransFuser++ based on this visual : https://www.youtube.com/watch?v=DZS-U3-iV0s&list=PL6LvknlY2HlQG3YQ2nMIx7WcnyzgK9meO&ab_channel=kashyap7x

  1. I see, I will see again a couple of examples provided in your article.

Thanks!

Kait0 commented 1 month ago
  1. Rerunning local_evaluation.sh would rerun all the routes which is perhaps inefficient since usually only individual routes fail. But yes if you can run it with no crash you have a valid result.

The evaluate_routes_slurm.py script is from another repository so it will likely not work as a drop in, but should be easy to adapt. Also you need to run this on a SLURM compute cluster.

  1. If you just want to look at examples (instead of large scale quantitative evaluation which longest6 is designed for) then you can just run a short route. I usually use this debug route file for that purpose.

  2. Both TF++ and TF have very similar inputs and outputs (There is a variant called TF++ WP that has exactly the same input and output). TransFuser++ also has a visualization function albeit a different one than TransFuser (we kept it more minimalistic). E.g. see: https://www.youtube.com/watch?v=ChrPW8RdqQU for some examples. You can adapt the visualization to your need by changing the code here.

MCUBE-2023 commented 1 month ago

Dear Author, Thank your for your quick reply :)

Pardon me, and emphasizing that I am really grateful to your interactivity, I still have a couple of points which still look ambiguous for me:

1- Referring to the 2 screenshots attached here, and referring to your reply when you say: "But yes if you can run it with no crash you have a valid result", and as you can see in the screenshots after running local_evaluation.sh, I got this message: A- "Error during the simulation: A sensor took too long to send their data" B- "RuntimeError: A sensor took too long to send their data. Stopping the route". C- The value of completion test is 70.22%. Taking in consideration these elements, do you consider the result that I obtained when I run local_evaluation.sh "valid" or no ? If it not valid, what's wrong exactly and what is message that I was supposed to have to make the experiment "valid" ?

image image

2- For the debug route file located in carla_garage (TF++), can I just use it by dropping it in the work_dir of Transfuser (for example, can I use it by directly dropping this file in work_dir_of_Transfuser/leaderboard/data) ? If yes, what are the changes that I should make in Transfuser, and which file should I run exactly in Transfuser, so I can visualize a short route please ? (please note that I am talking about running TF and not TF++).

3- Clear. Thanks!

Kait0 commented 1 month ago

1: No it is not valid. The Error message A should not occur. B and C are fine. You can manually check the transfuser_longest6.json file to get a feel for this. There are 4 possible status messages in there that indicate an error (e.g. "Failed - Simulation crashed", see here). If they occur you need to rerun the route.

"Errors" that are the fault of the model (e.g. "Failed - Agent got blocked") are fine.

  1. Yes you can drop in that file. Just change this line to the debug.xml path.
MCUBE-2023 commented 1 month ago

Dear Author, thank you so much for your receptivity and for your quick replies.

1- I rerun the route by running local_evaluation.sh. As a matter of fact, I obtained a result (second run) that is different from the one (first run) that is illustrated previously in this issue. Contrarily to the first run (where the experiment stopped at RouteScenario_0), in the second run these scenarios were evaluated in this order (please see the attached screen-shots which I enumerated from 1 to 19): RouteScenario_0 --> RouteScenario_1 --> RouteScenario_2 --> RouteScenario_3 --> RouteScenario_4 -->RouteScenario_5 --> RouteScenario_6 --> RouteScenario_7 --> RouteScenario_8 --> RouteScenario_9 --> RouteScenario_10 --> RouteScenario_11. However, the experiment stopped at RouteScenario_11 and as you can see in the screen_shot number 18, I got the error A- "Error during the simulation: A sensor took too long to send their data. And when I checked the content of transfuser_longest6.json for this second run, I noticed the message "'Failed - Simulation crashed'" for RouteScenario_11. Please see the attached screenshots, which I number from 1 to 19 for this second run. To this end, I have the following questions, please: 1-a. In the first run, upon obtaining the message A- "Error during the simulation: A sensor took too long to send their data" in RouteScenario_0, I applied your advice and I rerun this experiment (second run). However for the second run, the experiment stopped at RouteScenario_11 with A- "Error during the simulation: A sensor took too long to send their data". So, why rerunning this experiment didn't solve the issue ?

1-b. How many RouteScenarios should be considered in the evaluation when running "local_evaluation.sh" to consider the experiment "valid"? For example, referring to my experiment in the second run, from RouteScenario_0 to RouteScenario_11 there are 11 RouteScenarios obtained in my experiment. And would you please enumerate the order of the RouteScenarios for your valid experiment? And what was the total time for running a valid experiment on your end? (please remind me of the hardware that you used for running this experiment).

1-c. How many times should I rerun the route to obtain a valid experiment ? Is it 3 times, 10 times, 1000 times ... ? I am asking you this question because the second run took around 12 hours and stooped at RouteScenario_11 (with Nvidia quadro RTX 6000), and in some point, it is unpractical to rerun the route (N times) for such a long experiment.


Now I am gonna move to the second part, which is related to using debug.xml. To this end, I have the following questions, please: 2-a. Have you tried to use the debug.xml (which is originally provided with Transfuser++) file by dragging and dropping it inside the Transfuser folder ? If yes, how long does the experiment took you to run local_evaluation.sh by using debug.xml instead of longest6.xml in Transfuser ? 2-b. When running local_evaluation.sh in Transfuser++ by using debug.xml. How long did the experiment took you to finish (please mention the hardware used for Transfuser ++) ? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Kait0 commented 1 month ago
  1. Your first 10 routes seem valid and the ~10% failure rate is expected, CARLA is not that stable. What you do is to remove the failed route from the transfuser_longest6.json and run the script again (make a copy if you haven't done this before). The evaluation will start at the last route e.g. 11 this time when RESUME=1 is set.

There are 36 routes in total, we usually do 3 repetitions. We used around 32 2080ti GPUs with which it takes roughly 12 hours. Again let me refer to the text here. I do not think it is a good idea to do research with longest6 if you only have a single GPU. Usually you evaluate the individual routes [https://github.com/autonomousvision/transfuser/tree/2022/leaderboard/data/longest6/longest6_split] all in parallel instead of all routes sequentially which you are doing. Older benchmarks like neat or Town05Short use less compute but are basically solved.

I have used debugging route files in this repo as well. Don't remember how long it takes since I cange the content of this file depending on what I want to test. Single short routes you can evaluate in a couple of minutes on a local computer but they are not statistically relevant evaluations of model performance (hence the name debug).

MCUBE-2023 commented 1 month ago

Thank you bunches for your receptivity and for your quick replies. Sorry if I am asking many questions, but this is a sign of how great the impact of your work is on the open-source community.

I have the following questions, please:

1- Referring to your quote "Single short routes you can evaluate in a couple of minutes on a local computer", if I am understanding you correctly, do you mean by the term "short route" one of the routes that belongs to this link [https://github.com/autonomousvision/transfuser/tree/2022/leaderboard/data/longest6/longest6_split], for example: longest_weathers_0.xml. Is that correct?

2- Let's assume that your answer for the previous question is yes, I run an experience by using a single short route which is longest_weathers_0.xml, and when I executed local_evaluation.sh, the experience took 1 hour and 15 minutes (my hardware ins NVIDIA Quadro RTX 6000). However, you mentioned that a short route can evaluate in a "couple of minutes". Now here's the question: 2.a. When you say, "couple of minutes", do you mean maximum 10 minutes, for example? 2.b. Taking in consideration my hardware (NVIDIA Quadro RTX 6000), is it normal that the experience of running a single short route takes 1 hour and 15 minutes for evaluation (which is presumably contradictory to the term "couple of minutes")?

Thanks 😊

Kait0 commented 1 month ago

1- No, this is still a long route (1-2 km usually) albeit a single one. 1 hour and 15min is normal for that.

I meant something like the training routes (~300m)

or Town5Short routes https://github.com/autonomousvision/transfuser/blob/cvpr2021/leaderboard/data/validation_routes/routes_town05_short.xml. These files contain multiple routes. If you want only a single one you can just edit the xml and remove the other routs.

MCUBE-2023 commented 1 month ago

Thank you so much for these insights! Indeed, it was really helpful in my research and I succeeded to run this experiment in a couple of minutes thanks to your help :)

What I did is I referred to the link that you pinpointed (transfuser/leaderboard/data/validation_routes/routes_town05_short.xml). Then I edited the routes_town05_short.xml and removed the other routes. Here's the edited version of routes_town05_short.xml:

<?xml version='1.0' encoding='UTF-8'?>

I run the evaluation, and it took me 15 minutes (I am so happy with that time and thank you bunches for helping me to obtain the desired result).

Now, I had an issue with parsing the results related to the edited version of "routes_town05_short.xml". When I run result_parser.py, I got this error:

Traceback (most recent call last): File "result_parser.py", line 384, in main() File "result_parser.py", line 251, in main for route in route_evaluation] File "result_parser.py", line 251, in for route in route_evaluation] KeyError: '1'

So would you please tell what are the changes that I should make to run result_parser.py correctly (without any errors) on this edited version of routes_town05_short.xml ?

Kait0 commented 4 weeks ago

Hm I don't know this error. You need to set the --xml option to your new route file (routes_town05_short.xml). If that is not the issue you need to debug to see what is going on. The error message doesn't give much hints.

MCUBE-2023 commented 3 weeks ago

Thank you so much! The issue is resolved.

I am really grateful for your help :)