Cannot reproduce the Privileged Agent reported results

fnozarian commented 4 years ago

I'm trying to reproduce the privileged agent by training and benchmarking it on my dataset. I generated the dataset by running Carla 0.9.6 and data_collector.py file with the default parameters. However, despite what you've mentioned in your paper, I had to generate 200 episodes for the training set to get 179103 frames! So my first question is how did you generate 174k frames with only 100 episodes? For the validation set, I've generated 20 episodes in the same town (town01) as training with 18188 frames.

Here is the train/val loss that I got so far:

As you can see, I can't get the validation loss smaller or close to 3e-5 as you mentioned on the README page. I've benchmarked the agent on both 128th and 256th checkpoints in Town02 but the results are far worse than what you've reported.

dotchen commented 4 years ago

So my first question is how did you generate 174k frames with only 100 episodes?

Can you make sure you exactly followed the setup instructions, and the code is unchanged? I would not worry about the exact number of episodes needed to generate 180k frames as long as the setup is correct, since each episode is random (the pedestrians, the start/target positions).

As you can see, I can't get the validation loss smaller or close to 3e-5 as you mentioned on the README page.

I am not sure where does 3e-5 comes from. The README page clearly says 5e-3. Regardless, I recommend double-checking the dataset, checking the ratios of the left/right/straight/follow commands and visualize the model predictions (provided in the training script) to make sure things are clean. Here is the loss plot of a well-trained privileged agent, for your reference:

birdview_loss

I've benchmarked the agent on both 128th and 256th checkpoints in Town02 but the results are far worse than what you've reported.

As the README page says, if you retrain the model you need to retune the PID parameters for the best performance. But again you need to do this after you get a well trained model. Were you able to reproduce the results using our provided model checkpoints?

fnozarian commented 4 years ago

Thank you for the reply and suggestions.

Can you make sure you exactly followed the setup instructions, and the code is unchanged?

I didn't touch the code except for fixing an import error for from train_util import one_hot in train_birdview.py

I would not worry about the exact number of episodes needed to generate 180k frames

Since I think the problem comes from the generated dataset, I was wondering why the simulator (or data collector script) is generating episodes that have on average nearly half of the length of episodes used in the paper.

I am not sure where does 3e-5 comes from. The README page clearly says 5e-3.

Sorry, I meant 5e-3.

Regardless, I recommend double-checking the dataset, checking the ratios of the left/right/straight/follow commands and visualize the model predictions (provided in the training script) to make sure things are clean.

I checked the command ratios by setting the --cmd-biased parameter. It turned out that the samples are heavily biased with the following ratios for four commands: 1: 5322 2: 2185 3: 6706 4: 164890 . But, even training with balancing samples didn't solve the problem. However, I found some weird examples in the visualization of train/val predictions in which ground truth waypoints violate (pass) vehicles and red lights as you can see in the following images:

Training: Row 2, Column 1 Row 3, Column 4

Validation:

Row 1, Column 4 Row 4, Column 3

Were you able to reproduce the results using our provided model checkpoints?

Yes, I benchmarked the provided privileged checkpoint and I could reproduce the same results.

dotchen commented 4 years ago

Yes, I benchmarked the provided privileged checkpoint and I could reproduce the same results.

Okay, this is a good start.

However, I found some weird examples in the visualization of train/val predictions in which ground truth waypoints violate (pass) vehicles and red lights as you can see in the following images:

If you look at the code you will find the ground-truth waypoints are simply car locations in the future. There is nothing weird if they cross vehicles or red lights. Does that make sense? However, some of the visualizations look indeed bothering to me (validation, row 2 col2, row 3 col1, row4 col2). These should not appear in clean autopilot data. Are you launching the Carla server with the -benchmark flag?

I checked the command ratios by setting the --cmd-biased parameter. It turned out that the samples are heavily biased with the following ratios for four commands: 1: 5322 2: 2185 3: 6706 4: 164890

This looks good to me. The follow commands are dominating because the other commands only appear at intersections. From my experience, without balancing you should still be able to train a good privileged model.

fnozarian commented 4 years ago

Does that make sense?

Hmm... I agree with the future locations, but you don't think that in these cases (e.g., consider the red light samples of validation row1, col4 and row2, col1) we might have the same state/input but ambiguous waypoint labels (i.e., both lights are red but one with crossing waypoints and one not)?

Are you launching the Carla server with the -benchmark flag?

Yes. To make sure that if something went not wrong during my last data generation, I generated different sets of data in different machines based on the instructions and trained different models, but I got the same results for all. I'm also getting this warning but I don't think it's related to the problem:

WARNING: Version mismatch detected: You are trying to connect to a simulator that might be incompatible with this API 
WARNING: Client API version     = 0.9.6-15-g2ce2ee3b 
WARNING: Simulator API version  = 0.9.6

dotchen commented 4 years ago

we might have the same state/input but ambiguous waypoint labels (i.e., both lights are red but one with crossing waypoints and one not)

Since the waypoints are future locations, there are scenarios where the red light turn green shortly after, resulting in the crossing waypoints. I would not worry about the ambiguity here if the network is able to pick up the knowledge.

but I got the same results for all

What is your final validation loss for this run? Do you see similar pattern in losses as the screenshot shown above? Have you tuned the PID controller parameters? What are your success rates? Do you see similar states described in your last post?

I'm also getting this warning but I don't think it's related to the problem

This is because you are using our egg file which is compiled separately from carla, which is expected.

fnozarian commented 4 years ago

What is your final validation loss for this run? Do you see similar pattern in losses as the screenshot shown above?

Here are the losses: The interesting thing is that the validation loss of your privileged checkpoint on my validation dataset was about 0.04 which is far from 5e-3, showing that there's something wrong with the dataset.

Have you tuned the PID controller parameters?

No, as you mentioned that "you need to do this after you get a well trained model."

What are your success rates?

Since the loss was not good enough I haven't benchmarked the checkpoints.

Do you see similar states described in your last post?

Not exactly the same examples, but still there are some weird examples that you can find here: https://drive.google.com/drive/folders/1U1rLvSvWEOmT9z14ee-3cYOoXT2935rq?usp=sharing and the Tensorboard logs of all three models with all train/val images here: https://drive.google.com/file/d/1Se2SdLxNWz6eMyt7WqRy97QbE0Gt5fMG/view?usp=sharing

dotchen commented 4 years ago

Thanks for providing the details. I double-checked the code and found out the PID values in data_collector.py are incorrect. It was a mistake I made while refactoring the codebase for release. Please check the updated data_collector.py for the correct PID values (two line changes). I think that explains the dataset problem in your visualizations and loss.

I apologize for the inconvenience. The rest of the code should be still intact. Please let us know if the problem persists after the change.

dotchen commented 4 years ago

@fnozarian Were you able to fix your problem after the change?

fnozarian commented 4 years ago

@dianchen96 thanks a lot for checking the code and finding the bug! Yes, it totally solved the problem and I could easily get the val loss even less than 5e-3 in one of my training and I'm currently benchmarking the agent. So far I got 100/100 on FullTown02-v1 and 50/50 on FullTown02-v2. I was waiting for the rest of the benchmark's results to answer, but I think it's going to give the same results as reported and you can close the issue :) I'll go through the rest of the code/phases soon to benchmark the image agent.

dotchen commented 4 years ago

Great, I'm glad it worked out. If you have any questions, feel free to shoot me an email at dchen@cs.utexas.edu

peiyunh commented 4 years ago

Have you tuned the PID controller parameters?

Hi @dianchen96 , could you help me understand why we might want to tune the PID parameters to reproduce the results, assuming these parameters have been tuned? Thanks!

dotchen commented 4 years ago

@peiyunh This is mainly because the training has some stochasticity (we did not set random seeds), and we find that the performance is a bit sensitive to the PID values. But I'd recommend trying the default ones and see how it works first.

peiyunh commented 4 years ago

Hi @dianchen96, when you suggest tuning PID values, which agent are you referring to? There are:

NoisyAgent (for collecting training data)
RoamingAgentMine (for evaluating autopilot on the NoCrash benchmark)
BirdViewAgent (AKA the privileged agent)
ImageAgent (AKA the sensorimotor agent)

In the code, their PID values are set independently to different values by default. I wonder if it is recommended to tune all of them. I might be wrong but my intuition is that maybe it is best to have the same PID values for all of them. The code says otherwise but would that be reasonable?

Also, in terms of autopilot, their PID values are set very differently (NoisyAgent vs. RoamingAgentMine). In fact, RoamingAgentMine produces quite poor results on NoCrash (through running benchmark_agent.py with --autopilot turned on), especially town02. I can get much better performance by setting RoamingAgentMine's PID values to the same ones as NoisyAgent (after the fix mentioned above). Could this be a bug?

Lastly, would you say your implementation of the autopilot is a better version compared to the one used in the original NoCrash paper? If so, I wonder how much better it is and how that contributes to the improvement in performance for the imitative learners.

Thanks a lot!

dotchen commented 4 years ago

when you suggest tuning PID values, which agent are you referring to

To get most performance out of the birdview agent, one should tune the birdview PID, and of the image agent, the image agent PID.

Also, in terms of autopilot, their PID values are set very differently

Ah yes thanks for the catch! This is related to the PID bugfix mentioned in this thread above. Please refer to the NoisyAgent for the correct PID value.

Lastly, would you say your implementation of the autopilot is a better version compared to the one used in the original NoCrash paper?

I'd say this is pretty much the same as the default autopilot in the original CARLA repo, as it can navigate through the towns with no problem. The NoCrash paper uses some complicated noise injection during data collection so I don;t think it is comparable.

peiyunh commented 4 years ago

Thanks for the reply!

When benchmarking the autopilot, I also found two weird types of collisions that are likely irrelevant to PID control. The first one is when the autopilot-controlled car gets rear-end by other cars. The other one is the autopilot-controlled vehicle sometimes would blatantly run a red light.

For the first case, my guess is that other cars are controlled by a different policy, for example, the default automatic control for roaming from CARLA, which is likely to collide more often. Is it fair to say that the focal vehicle runs a different autopilot policy compared to other vehicle in the scene?

For the second case, I am not sure why it would run red lights though. Do you have any intuition? I can attach videos if needed.

dotchen commented 4 years ago

Is it fair to say that the focal vehicle runs a different autopilot policy compared to other vehicle in the scene?

That's exactly what happens. In CARLA 0.9.6 other vehicles' controllers are implemented on the server/C++ side.

I am not sure why it would run red lights though.

How often does this happen? Last time I checked this does not happen except there is one traffic light constantly ignored because it is mislabeled on the CARLA map... Let me know if this is not the case though.

dotchen commented 4 years ago

Also if you don't mind you can open a new issue to discuss this, or take the conversation on email.

dotchen / LearningByCheating

Cannot reproduce the Privileged Agent reported results #2