questions about training

chengcheng8632 commented 4 years ago

Dear author: I'm sorry to bother you again. When I increased the kDefaultMaxCwndInMss of QuicConstants.h file in mvfst/quic to 10000. I only trained the bandwidth of 12mbps at different delays. My train command as follows: python3 -m train.train --mode=train --base_logdir=/tmp/logs --total_steps=100000 --learning_rate=0.00001 --num_actors=2 --cc_env_history_size=20 I observed such a problem during training. First of all ,I found that the value of cwnd can be increased to a large value, the delay and throughput will also be very large, and the reward value will also be very large. The training process seems to have not converged. Then, our cwnd value is small, but the delay is large. Finally, after training, the tests found that the selected cwnd values were almost 10000. This phenomenon is a bit strange. The some of training process and test results are attached. test_12.log train_12.log

I want to know if this phenomenon is caused by the incorrect setting of our kDefaultMaxCwndInMss or the problem of the algorithm itself. Looking forward to your answer. Thank you very much.

odelalleau commented 4 years ago

Hi @chengcheng8632,

Looking at the logs, it looks like the policy gets stuck in either of these two behaviors: (a) forever increase cwnd until its max value, or (b) forever decrease cwnd until its min value.

I suspect that the choice between (a) or (b) depends on the env's delay, since you are using different values, but I'm finding hard to be sure from the logs. You could confirm it by running an independent test for each delay value.

I'm not sure how sub-optimal are these behavior in these situations: I'll need to run a few tests on my end to get a better idea of whether or not there is a problem with the learning process. I'll get back to you on this.

Just to be clear, is that a problem that you only observe after recompiling with kDefaultMaxCwndInMss = 10000 (using the default value of 2000 gives you a different behavior, everything else being the same?)

chengcheng8632 commented 4 years ago

Hi@odelalleau, I only used mvfst-rl for training in the 12mbps bandwidth environment under 10 delay this time. At the same time, I also modified kDefaultMaxCwndMss=2000 and recompiled. The training commands are as follows: python3 -m train.train --mode=train --base_logdir=/tmp/logs --total_steps=20000 --learning_rate=0.00001 --num_actors=1 --cc_env_history_size=20 The results of training and testing are as follows: test_12_cwnd_2000.log train_12_cwnd_2000.log It can be seen from the training process that our mvfst-rl algorithm also has the above problems. We also used the same environment to test in pantheon. The test results are the same as before, and kDefaultMaxCwndInMss will be selected. The results show that our algorithm does not seem to work. If you have tests, is there such a result? Meanwhile, I used the model that you trained to test in our actual environment and found that the choice of cwnd is the same as the test result. Cwnd will be equal to kDefaultMaxCwndInMss, but the inflight is very small. I feel strange to the above phenomenon. Looking forward to your reply, thank you very much!

odelalleau commented 4 years ago

Thanks for the follow-up! I didn't have time to check it out today but hopefully I'll be able to provide an update tomorrow :)

chengcheng8632 commented 4 years ago

Ok, hope you can take the time to test it, thank you very much！

odelalleau commented 4 years ago

A first thing I noticed is that in your logs you are using --cc_env_reward_packet_loss_factor=10.0. Is that intended? (it doesn't appear as overridden in your command line, and the default value is supposed to be 0)

chengcheng8632 commented 4 years ago

Hi @ odelalleau， I am very sorry, this is my mistake. I modified the code and retrained. The training commands are as follows: python3 -m train.train --mode=train --base_logdir=/tmp/logs --total_steps=20000 --learning_rate=0.00001 --num_actors=1 --cc_env_history_size=20 The training results and test results are as follows:

test_12_2000.log train_12_2000.log

From the training log, I did not find that cwnd has a tendency to converge. In the test results, cwnd has always selected 10, and the RL algorithm does not seem to work. If you have time, you can test whether it is the same as mine. Looking forward to your reply, thank you very much!

odelalleau commented 4 years ago

Thanks for providing these new logs, indeed this deserves being looked into. I ran a few experiments on my side and I also observed similar results (though not exactly the same as yours, but it may be due to variance in the results or to some local changes I have on my side). One thing to keep in mind is that the policy is deterministic at test time (while it is stochastic during training), which explains why you may see a different behavior in the train vs test logs. I'll be digging more into the agent's training & test behaviors in the next few days, at which point I should be able to provide a more in-depth update regarding this issue.

chengcheng8632 commented 4 years ago

Thanks, @odelalleau. Looking forward to your good news!

odelalleau commented 4 years ago

Hi @chengcheng8632, I just wanted to let you know that my investigation is still ongoing. So far at least I can confirm that the agent is sometimes behaving in unexpected (and possibly sub-optimal) ways. I am currently looking into the reasons and how to improve on this.

chengcheng8632 commented 4 years ago

Hi，@odelalleau，thank you for your testing and follow-up. I will continue to follow up. Looking forward to better results! Thanks again!

chengcheng8632 commented 4 years ago

Hi，@odelalleau，I also found some problems in the test. First, during the 108mbps bandwidth environment test, if the model converged, I found that the ramp was slow during the startup phase. Is this the cause of the environment? Does our pantheon environment need to change? Second, I analyzed the input features and found that there is a strong correlation between these features. Do we need to decouple these features? At the same time, whether it needs to be normalized. If you have time, looking forward to your reply. Thank you very much!

weiyuxingchen commented 4 years ago

Hello, @odelalleau ,first of all, thank you very much for your project sharing, but I would like to describe some questions in the process of reproduction:

When we train under a single trace, we find that it is difficult to converge and can not achieve the effect of the paper. After training, the model always uses a single action, such as * 2, / 2, instead of + 10, - 10; or vice versa;
When we read the code, we found that you will truncate the input before LSTM to [- 1, 1]. For 108MB bandwidth environment, reward will basically exceed this range, so whether violent truncation will cause problems. In your paper, figure 3 shows that reward will exceed - 25000, which we do not know;
We have made a lot of modification attempts: input level modification, network structure modification, various reward modifications, but the large probability can not converge, let alone multiple tracks training together;
Congestion control should not be a particularly complex task. We have made many attempts with your method, but all of them can't converge well (single environment). Is reinforcement learning so bad (we also contact reinforcement learning for the first time), or what are the limitations of your method we don't know;

Looking forward to your reply.

odelalleau commented 4 years ago

@weiyuxingchen I'll follow-up on your questions in #27

@chengcheng8632 just to let you know that I'm still investigating the algorithm's behavior, and I'll make sure to answer your questions once I have good answers to them :)

odelalleau commented 4 years ago

Hi @chengcheng8632 , I just wanted to let you know that I am still looking into it. I actually found some potential issues in the current implementation, and I'm working on a fix. I'll share more once I am confident that things are working as intended.

chengcheng8632 commented 4 years ago

Hi，@odelalleau，thank you very much!

odelalleau commented 3 years ago

Hi @chengcheng8632, I apologize that it took so long to get back to you on this (!) It took me a while to identify the problems / fix them / get the code in a good shape for release. FYI the main two issues were related to the bandwidth and delay computations (and since the reward is based on these, this was affecting training). If you're still curious about giving it a try, I suggest that you re-install everything from scratch. I'll close the issue for now, but feel free to open a new one if you run into new problems. I should now be able to address them more swiftly :)

facebookresearch / mvfst-rl

questions about training #25