facebookresearch / mvfst-rl

An asynchronous RL platform for congestion control in QUIC transport protocol. https://arxiv.org/abs/1910.04054.
Other
154 stars 34 forks source link

questions about training and using the trained model #29

Closed Zidan241 closed 3 years ago

Zidan241 commented 3 years ago

1- while training i have been facing this error: putty_uNhmc93aB1 onObservation: Still waiting for an update from ActorPoolServer, skipping observation It seems that some times the agent takes more time than usual and did not return the action for the previous observation, it occurs a lot in each episode. It was said that this rarely occurs, is there a reason for this?


2- Is it possible to use the exported model in real environment, can i for example run a mvfst server that runs the trained model?

odelalleau commented 3 years ago

Regarding onObservation: Still waiting for an update from ActorPoolServer, skipping observation it is expected to see it at the beginning of training -- I noticed there is some kind of "warm up" period where things get slowed down (as it's launching all the actor processes, building the models, etc.) but it should eventually vanish. Since you're running on CPU only though, it's possible that you may run into performance issue. My suggestion would be to try with fewer actors (start with num_actors=1 total_steps=1000 and increase num_actors progressively to see when it breaks, also keep inference_batch_size=1 for these tests as a higher inference batch size may cause delays with few actors).

I'm seeing in your log that the traffic_gen executable failed though, this seems unexpected and it's worth looking into it more closely. Is there any previous error message in the log that could explain this failure?

2- Is it possible to use the exported model in real environment, can i for example run a mvfst server that runs the trained model?

This is still WIP but it should be possible. The version of mvfst that is synched in the third_party folder has copy of the mvfst-rl files, and you can see an example where it's used in tperf. Hopefully that should be enough to get you started (don't hesitate if you have questions on this code).

Zidan241 commented 3 years ago

Sorry for the late response, i have tried running only 1 actor for 1000 time steps for error debugging as suggested, here is the session log session.log


these are some of the errors i observed

0521 23:14:47.030474 164032 CongestionControlLocalEnv.cpp:37] onObservation: Still waiting for an update from model, skipping observation
terminate called after throwing an instance of 'std::runtime_error'
  what():  The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "<string>", line 3, in forward

      def addmm(self: Tensor, mat1: Tensor, mat2: Tensor, beta: number = 1.0, alpha: number = 1.0):
          return self + mat1.mm(mat2)
                        ~~~~~~~ <--- HERE

      def batch_norm(input : Tensor, running_mean : Optional[Tensor], running_var : Optional[Tensor], training : bool, momentum : float, eps : float) -> Tensor:
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x1025 and 513x2052)

*** Aborted at 1621638887 (Unix time, try 'date -d @1621638887') ***
*** Signal 6 (SIGABRT) (0x3e90002809c) received by PID 163996 (pthread TID 0x7f3327fff700) (linux TID 164086) (maybe from PID 163996, UID 1001) (code: -6), stack trace: ***
(error retrieving stack trace)
I0521 23:14:47.074617 163492 CongestionControlEnv.cpp:218] Num states = 2 avg throughput = 0.0609455 MB/sec, avg delay = 74.8455 ms, max delay = 103.764 ms, total Mb lost = 0, reward = -3.72604
I0521 23:14:47.074885 163492 CongestionControlEnv.cpp:50] Action updated (cwndAction=0, cwnd=10), policy elapsed time = 0.003182 ms
I0521 23:14:47.146627 164058 CongestionControlEnv.cpp:218] Num states = 2 avg throughput = 0.029304 MB/sec, avg delay = 8.994 ms, max delay = 17.988 ms, total Mb lost = 0, reward = -4.10763
I0521 23:14:47.146849 164058 CongestionControlEnv.cpp:50] Action updated (cwndAction=0, cwnd=10), policy elapsed time = 0.002698 ms
Traceback (most recent call last):
  File "/home/zidan/bachelor/mvfst-rl/_build/deps/pantheon/src/wrappers/mvfst_rl.py", line 121, in <module>
    main()
  File "/home/zidan/bachelor/mvfst-rl/_build/deps/pantheon/src/wrappers/mvfst_rl.py", line 101, in main
    check_call(cmd)
  File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/home/zidan/bachelor/mvfst-rl/_build/deps/pantheon/third_party/mvfst-rl/_build/build/traffic_gen/traffic_gen', '--mode=server', '--host=0.0.0.0', '--port=41213', '--cc_algo=rl', '--cc_env_mode=local', '--cc_env_rpc_address=unix:/tmp/rl_server_path_0', '--cc_env_actor_id=17', '--cc_env_job_id=-1', '--cc_env_model_file=/mnt/disks/disk1/logs2/traced_model.pt', '--cc_env_agg=time', '--cc_env_time_window_ms=100', '--cc_env_fixed_window_size=10', '--cc_env_use_state_summary=True', '--cc_env_history_size=20', '--cc_env_norm_ms=100.0', '--cc_env_norm_bytes=1000.0', '--cc_env_actions=0,/2,-10,+10,*2', '--cc_env_reward_log_ratio=True', '--cc_env_reward_throughput_factor=1.0', '--cc_env_reward_throughput_log_offset=1e-05', '--cc_env_reward_delay_factor=0.2', '--cc_env_reward_delay_log_offset=1e-05', '--cc_env_reward_packet_loss_factor=0.0', '--cc_env_reward_packet_loss_log_offset=1e-05', '--cc_env_reward_max_delay=True', '--cc_env_fixed_cwnd=10', '--cc_env_min_rtt_window_length_us=10000000000', '-v=1']' returned non-zero exit status -6
I0521 23:14:47.187276 163492 CongestionControlEnv.cpp:218] Num states = 3 avg throughput = 0.0641177 MB/sec, avg delay = 74.3493 ms, max delay = 95.678 ms, total Mb lost = 0, reward = -3.65908
I0521 23:14:47.187541 163492 CongestionControlEnv.cpp:50] Action updated (cwndAction=0, cwnd=10), policy elapsed time = 0.003052 ms
I0521 23:14:47.301468 163492 CongestionControlEnv.cpp:218] Num states = 2 avg throughput = 0.065665 MB/sec, avg delay = 127.832 ms, max delay = 138.54 ms, total Mb lost = 0, reward = -3.70927
I0521 23:14:47.301692 163492 CongestionControlEnv.cpp:50] Action updated (cwndAction=0, cwnd=10), policy elapsed time = 0.003256 ms
Traceback (most recent call last):
  File "/home/zidan/bachelor/mvfst-rl/_build/deps/pantheon/src/wrappers/mvfst_rl.py", line 121, in <module>
    main()
  File "/home/zidan/bachelor/mvfst-rl/_build/deps/pantheon/src/wrappers/mvfst_rl.py", line 101, in main
    check_call(cmd)
  File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/home/zidan/bachelor/mvfst-rl/_build/deps/pantheon/third_party/mvfst-rl/_build/build/traffic_gen/traffic_gen', '--mode=server', '--host=0.0.0.0', '--port=46403', '--cc_algo=rl', '--cc_env_mode=local', '--cc_env_rpc_address=unix:/tmp/rl_server_path_0', '--cc_env_actor_id=15', '--cc_env_job_id=-1', '--cc_env_model_file=/mnt/disks/disk1/logs2/traced_model.pt', '--cc_env_agg=time', '--cc_env_time_window_ms=100', '--cc_env_fixed_window_size=10', '--cc_env_use_state_summary=True', '--cc_env_history_size=20', '--cc_env_norm_ms=100.0', '--cc_env_norm_bytes=1000.0', '--cc_env_actions=0,/2,-10,+10,*2', '--cc_env_reward_log_ratio=True', '--cc_env_reward_throughput_factor=1.0', '--cc_env_reward_throughput_log_offset=1e-05', '--cc_env_reward_delay_factor=0.2', '--cc_env_reward_delay_log_offset=1e-05', '--cc_env_reward_packet_loss_factor=0.0', '--cc_env_reward_packet_loss_log_offset=1e-05', '--cc_env_reward_max_delay=True', '--cc_env_fixed_cwnd=10', '--cc_env_min_rtt_window_length_us=10000000000', '-v=1']' returned non-zero exit status -6

there where no time steps that expercienced onObservation: Still waiting for an update from ActorPoolServer, skipping observation however onObservation: Still waiting for an update from model, skipping observation showed up a couple of times while training, i donot really understand the difference, but it was always followed with a runtime error


I know that running it on cpu only can be challenging, but if these errors donot affect the model much, can i just train a model with a decreased number of actors and higher number of time steps?


firefox_ctQySdX30h firefox_NsS0ySCs2n Also these where the outputs of the previous run i had with 40 actors on tensorboard, is this randomness normal?

odelalleau commented 3 years ago

The error you're seeing is an issue I've fixed internally, but the version you're using doesn't have it I think: the C++ model has a hardcoded LSTM hidden size to 1024, but the default value in Python is 512 => you need to train with hidden_size=1024 (or rerun ./build.sh after seting const int kLSTMHiddenSize = 512 + 1; in CongestionControlLocalEnv.cpp)

I know that running it on cpu only can be challenging, but if these errors donot affect the model much, can i just train a model with a decreased number of actors and higher number of time steps?

Yeah, actually no need to increase the number of timesteps: it will just take longer to reach this number with fewer actors.

Also these where the outputs of the previous run i had with 40 actors on tensorboard, is this randomness normal?

I think so -- you may want to smooth curves more in Tensorboard. Also FYI Tensorboard is mostly for debugging purpose, if you want to analyze results I suggest you use scripts/plotting/plot_sweep.py (see comments at the top on how to setup Jupyter Lab to use it).

You can try running with these settings (possibly also changing num_actors and inference_batch_size), it should be able to train reasonably well on the first 6 scenarios(*):

python3 -m train.train \
mode=train \
test_after_train=false \
train_job_ids="[0,1,2,3,4,5]" \
total_steps=1_000_000 \
hidden_size=1024 \
unroll_length=8 \
end_of_episode_bootstrap=true \
entropy_cost=1e-3 \
baseline_cost=0.5 \
discounting=0.99 \
reward_clipping=none \
reward_normalization_coeff=1e-5 \
learning_rate=0.0005 \
alpha=0.9 \
momentum=0.001 \
cc_env_history_size=16 \
cc_env_reward_delay_factor=0.75 \
cc_env_reward_packet_loss_factor=0

Then in plot_sweep.py set split_by_job_id=True when calling load_experiments()

(*) Just a heads-up that currently it doesn't generalize well to new settings. This part is still WIP.

Edit: training with test_after_train=false isn't necessarily required, I usually do that because my cluster machines can't run the test mode, and I run tests separately on the most promising runs.

Zidan241 commented 3 years ago

The last run with the new parameters looked promising, it took along time however there were no errors except for this error retrieving stack trace on every end of episode, which I assumed is normal: firefox_pt8zcYxg6o


Then in plot_sweep.py set split_by_job_id=True when calling load_experiments()

However I had a problem with the plot_sweep, experiments were loaded successfully however no graphs where plotted : firefox_gjvVZ3qIl1

odelalleau commented 3 years ago

The last run with the new parameters looked promising, it took along time however there were no errors except for this error retrieving stack trace on every end of episode, which I assumed is normal:

I'm not seeing these on my system but I wouldn't worry about it. Episodes are "killed" somewhat abruptly after 30s so it's normal to see these "Signal 15 (SIGTERM)" alerts. I assume the stack trace error you're seeing is related to this.

However I had a problem with the plot_sweep, experiments were loaded successfully however no graphs where plotted :

Hmm, in Jupyter Lab, to open plot_sweep.py, right click on the file name and do Open With / Notebook. It should look like a Jupyter Notebook (except that the code is a regular Python file, which is more convenient to edit). Then execute all cells in order -- the "interesting" one, plotting stuff, should be the "Training curves" section. HiPlot is useful only when running multiple experiments, e.g., for hyper-parameter selection.

odelalleau commented 3 years ago

@Zidan241 just wanted to let you know that I pushed the latest version of the code to the master branch. It should be pretty close to what you had, so you don't necessarily have to re-install again, but if you ever need to re-install I suggest that you grab that new version instead.

Zidan241 commented 3 years ago

@odelalleau Thank you so much for your continuous support, I have questions about running tperf: 1-so running it as a server runs it as a learner where it maintains the actors, and running it a client means it is a sender and communicates state updates to the server? or am i misunderstanding the architecture?

Each actor corresponds to a separate Pantheon instance in a sender–receiver setup with a randomly chosen emulated network scenario.

2-However in a real environment what does each actor correspond to?

odelalleau commented 3 years ago

@odelalleau Thank you so much for your continuous support,

Happy to help :)

I have questions about running tperf: 1-so running it as a server runs it as a learner where it maintains the actors, and running it a client means it is a sender and communicates state updates to the server? or am i misunderstanding the architecture?

Each actor corresponds to a separate Pantheon instance in a sender–receiver setup with a randomly chosen emulated network scenario.

2-However in a real environment what does each actor correspond to?

I sense some confusion here... let me try to clarify (and ask me if this isn't clear enough):

When training the model, there is one Python process running learner.py which does two things:

In addition to this process, the code also spawns multiple "actor" processes to collect data. Each episode is a run of _build/deps/pantheon/src/experiments/test.py, which is also going to spawn sub-processes to simulate a given network scenario for 30s (in the end the actual executable running the network code is traffic_gen). The traffic_gen executable connects to the inference server to apply congestion control decisions (congestion_control/CongestionControlRPCEnv.h).

The above applies to training (mode=train). The test mode is different: there is no inference server anymore (no learner process), and instead the traffic_gen executable directly loads the model's weights and execute it (using TorchScript). There are still multiple "actors" in parallel (one per network scenario -- you can select which ones to execute with the test_job_ids option).

Finally, you can run tperf directly (with the version of tperf that comes with the version of mvfst used by this repository). This is a completely independent setting (it's not using Pantheon / mahi-mahi, and if you want to simulate specific network conditions you'll have to do it by yourself, e.g. with tc). There isn't any script to launch / analyze tperf results at this time. When running with tperf you'd typically first run a server, with something looking like:

tperf --mode=server --congestion=rl --pacing=true --gso=true --window=15000000 --max_cwnd_mss=4294967295 --num_streams=1 --port=11000 --num_server_worker=1 --host=127.0.0.1 -cc_env_agg='time' -cc_env_time_window_ms='100' -cc_env_fixed_window_size='10' -cc_env_use_state_summary='True' -cc_env_history_size='16' -cc_env_norm_ms='100.0' -cc_env_norm_bytes='1000.0' -cc_env_actions='0,/2,-10,+10,*2' -cc_env_reward_log_ratio='True' -cc_env_reward_throughput_factor='1.0' -cc_env_reward_throughput_log_offset='1e-05' -cc_env_reward_delay_factor='0.75' -cc_env_reward_delay_log_offset='1e-05' -cc_env_reward_packet_loss_factor='0.0' -cc_env_reward_packet_loss_log_offset='1e-05' -cc_env_reward_max_delay='True' -cc_env_fixed_cwnd=10 -cc_env_min_rtt_window_length_us='10000000000' -congestion='rl' -cc_env_job_id='-1' -cc_env_mode=local -cc_env_model_file="/path/to/traced_model.pt"

and then run a test with the client:

tperf --mode=client --window=15000000 --max_cwnd_mss=4294967295 --duration=30 --port=11000 --host=127.0.0.1

You can have a look at the script scripts/get_tperf_args.py to figure out which arguments to pass to tperf on the server side.

Hope this helps!

Zidan241 commented 3 years ago

@odelalleau thank you for your descriptive explanation, i am in the process of understanding the library better. I want to understand the experiments better, to be able to justify the difference in the plotted graphs between the jobs. 1- Each episode runs a random experiment, is their a detailed log of this? 2- While running a 1000 step training, i observed some missing files session1.log as you stated above

Each episode is a run of _build/deps/pantheon/src/experiments/test.py

so this run for example:

Thread: 0, episode: 0, experiment: 0, cmd: /mnt/disks/mvfst-rl/mvfst-rl/_build/deps/pantheon/src/experiments/test.py local --data-dir /mnt/disks/mvfst-rl/checkpoint/run/2021-06-01_22-39-25/train/train_tid0_run0_expt0 --pkill-cleanup --uplink-trace /mnt/disks/mvfst-rl/mvfst-rl/train/traces/0.57mbps-poisson.trace --downlink-trace /mnt/disks/mvfst-rl/mvfst-rl/train/traces/0.57mbps-poisson.trace --prepend-mm-cmds mm-delay 28 mm-loss uplink 0.0477 --extra-mm-link-args --uplink-queue=droptail --uplink-queue-args=packets=14 --schemes=mvfst_rl --run-times=1 --extra-sender-args="--cc_env_mode=remote --cc_env_rpc_address=unix:/tmp/rl_server_path_0 --cc_env_actor_id=0 --cc_env_job_id=0 --cc_env_model_file=/mnt/disks/mvfst-rl/checkpoint/run/2021-06-01_22-39-25/traced_model.pt --cc_env_agg=time --cc_env_time_window_ms=100 --cc_env_fixed_window_size=10 --cc_env_use_state_summary=True --cc_env_history_size=16 --cc_env_norm_ms=100.0 --cc_env_norm_bytes=1000.0 --cc_env_actions=0,/2,-10,+10,*2 --cc_env_reward_log_ratio=True --cc_env_reward_throughput_factor=1.0 --cc_env_reward_throughput_log_offset=1e-05 --cc_env_reward_delay_factor=0.75 --cc_env_reward_delay_log_offset=1e-05 --cc_env_reward_packet_loss_factor=0.0 --cc_env_reward_packet_loss_log_offset=1e-05 --cc_env_reward_max_delay=True --cc_env_fixed_cwnd=10 --cc_env_min_rtt_window_length_us=10000000000 -v=1"

the data for this episode was supposed to be saved to /mnt/disks/mvfst-rl/checkpoint/run/2021-06-01_22-39-25/train/train_tid0_run0_expt0 however, it appears to not have been saved: firefox_WzF8RucHVQ

odelalleau commented 3 years ago

Folders like train_tid0_run0_expt0 are temporary folders that get deleted after the end of the episode, at this line: https://github.com/facebookresearch/mvfst-rl/blob/5843153460a3b86868cd176a60dff8bf4342e668/train/pantheon_env.py#L134

The detailed log is essentially available only in the central stderr output, currently there's no mechanism to see only the output of a single episode. I think you could redirect the stderr for each actor to a different line by passing a stderr argument on this line: https://github.com/facebookresearch/mvfst-rl/blob/5843153460a3b86868cd176a60dff8bf4342e668/train/pantheon_env.py#L120

Each episode's statistics are also saved in the subfolder train/logs.tsv (but you don't have the details of everything that happened).

Hope this helps, and don't hesitate if you have more questions!

Zidan241 commented 3 years ago

@odelalleau as always thank you for your fast response, I set the stderr=subprocess.STDOUT however no change, i was able to get more about the packets through the RLCongestionController.cpp by lowering the verbose level of the log message wanted: https://github.com/facebookresearch/mvfst-rl/blob/5843153460a3b86868cd176a60dff8bf4342e668/congestion_control/RLCongestionController.cpp#L47-L50


I have a couple of question concerning the training process: 1- I just want to make sure i understand this correctly, "The RL congestion controller accumulates network statistics from ACKs over a fixed time-window", and on each packet acknowledged the environment enters a new state calculated from information from the previous state and from the acknowledged packet. Each state experienced in this fixed time window is added to states array and at the end of the time window sent to the RL congestion controller. for example the values from this log are calculated from the states in the state array for this time window slot: Num states = 3 avg throughput = 0.0223207 MB/sec, avg delay = 41.421 ms, max delay = 82.428 ms, total Mb lost = 0, reward = -7.11074

2- In the policy gradient update, what does 'SPS' and 'Loss' signify? [INFO:14588 learner:971 2021-06-01 22:39:47,206] Step 128 @ 8.5 SPS. Loss: -2.053349.

3- I am having problem understanding pantheon, for example this experiment https://github.com/facebookresearch/mvfst-rl/blob/5843153460a3b86868cd176a60dff8bf4342e668/train/experiments.yml#L26-L36 I understand that it uses mahi-mahi for real-time network emulation, but how is the congestion emulated? How is it calibrated for these locations? is it related to the tunnel server & tunnel client?

odelalleau commented 3 years ago

I set the stderr=subprocess.STDOUT however no change

I guess this would just redirect all actors to stdout, I had in mind some kind of manual redirection to a different file per actor (see e.g. https://stackoverflow.com/questions/4856583/how-do-i-pipe-a-subprocess-call-to-a-text-file).

1- I just want to make sure i understand this correctly (...)

Yes this sounds correct. The most important functions to look at to understand this process are:

2- In the policy gradient update, what does 'SPS' and 'Loss' signify? [INFO:14588 learner:971 2021-06-01 22:39:47,206] Step 128 @ 8.5 SPS. Loss: -2.053349.

SPS = learner steps per second (should increase as you add more actors since the learner is crunching the data coming from the actors => if it doesn't increase it means your learner isn't able to keep up with the actors and you should reduce the # of actors)

Loss = the total loss being optimized by the learner. It's not very useful -- better look at individual loss components in Tensorboard if you want to check what is going on.

3- I am having problem understanding pantheon

I can't pretend I fully understand it :P

I understand that it uses mahi-mahi for real-time network emulation, but how is the congestion emulated?

How is it calibrated for these locations? is it related to the tunnel server & tunnel client?

The calibration refers to the fact that these specific settings (the trace file, the delay, the packet loss, the size of the queue) have been fine-tuned so that the overall behavior mimics what happened on a "real world" link from Nepal to AWS India.

Hope this helps!

Zidan241 commented 3 years ago

@odelalleau, I am facing this error when i try to connect the client, i have tried both the model on the repo and one i trained and tested on my machine

(mvfst-rl) zidan@vm1:/mnt/disks/mvfst-rl/mvfst-rl/third-party/mvfst/_build/build/quic/tools/tperf$ ./tperf --mode=server --congestion=rl --pacing=true --gso=true --window=15000000 --max_cwnd_mss=4294967295 --num_streams=1 --port=11000 --num_server_worker=1 --host=127.0.0.1 -cc_env_agg='time' -cc_env_time_window_ms='100' -cc_env_fixed_window_size='10' -cc_env_use_state_summary='True' -cc_env_history_size='16' -cc_env_norm_ms='100.0' -cc_env_norm_bytes='1000.0' -cc_env_actions='0,/2,-10,+10,*2' -cc_env_reward_log_ratio='True' -cc_env_reward_throughput_factor='1.0' -cc_env_reward_throughput_log_offset='1e-05' -cc_env_reward_delay_factor='0.75' -cc_env_reward_delay_log_offset='1e-05' -cc_env_reward_packet_loss_factor='0.0' -cc_env_reward_packet_loss_log_offset='1e-05' -cc_env_reward_max_delay='True' -cc_env_fixed_cwnd=10 -cc_env_min_rtt_window_length_us='10000000000' -congestion='rl' -cc_env_job_id='-1' -cc_env_mode=local -cc_env_model_file="/mnt/disks/mvfst-rl/mvfst-rl/models/traced_model.pt"
I0616 14:04:31.012548  5993 tperf.cpp:361] TPerfAcceptObserver attached
I0616 14:04:31.012651  5992 tperf.cpp:651] tperf server started at: 127.0.0.1:11000
I0616 14:06:08.547523  5993 RLCongestionControllerFactory.h:36] Creating RLCongestionController
I0616 14:06:08.552242  5993 CongestionControlLocalEnv.cpp:21] Loading traced model from /mnt/disks/mvfst-rl/mvfst-rl/models/traced_model.pt
I0616 14:06:08.631366  5993 tperf.cpp:420] Starting sends to client.
terminate called after throwing an instance of 'c10::Error'
  what():  Expected at most 3 argument(s) for operator 'forward', but received 4 argument(s). Declaration: forward(__torch__.traced_model self, Dict(str, Tensor) argument_1, (Tensor, Tensor) argument_2) -> (((Tensor, Tensor, Tensor), (Tensor, Tensor)))
Exception raised from checkAndNormalizeInputs at ../aten/src/ATen/core/function_schema_inl.h:245 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7f2201c72b89 in /mnt/disks/mvfst-rl/mvfst-rl/_build/deps/libtorch/lib/libc10.so)
frame #1: <unknown function> + 0xbc5727 (0x7f2202a65727 in /mnt/disks/mvfst-rl/mvfst-rl/_build/deps/libtorch/lib/libtorch_cpu.so)
frame #2: torch::jit::GraphFunction::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&) + 0x2d (0x7f2204c3f93d in /mnt/disks/mvfst-rl/mvfst-rl/_build/deps/libtorch/lib/libtorch_cpu.so)
frame #3: torch::jit::Method::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&) + 0x16f (0x7f2204c5072f in /mnt/disks/mvfst-rl/mvfst-rl/_build/deps/libtorch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x2a0faf (0x55d9af964faf in ./tperf)
frame #5: <unknown function> + 0xc819d (0x7f22014ab19d in /home/zidan/anaconda3/envs/mvfst-rl/lib/libstdc++.so.6)
frame #6: <unknown function> + 0x9609 (0x7f22015b1609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #7: clone + 0x43 (0x7f22011b0293 in /lib/x86_64-linux-gnu/libc.so.6)

*** Aborted at 1623852368 (Unix time, try 'date -d @1623852368') ***
*** Signal 6 (SIGABRT) (0x3ea00001768) received by PID 5992 (pthread TID 0x7f21fb7f5700) (linux TID 6070) (maybe from PID 5992, UID 1002) (code: -6), stack trace: ***
(error retrieving stack trace)
(mvfst-rl) zidan@vm1:/mnt/disks/mvfst-rl/mvfst-rl/third-party/mvfst/_build/build/quic/tools/tperf$ ./tperf --mode=server --congestion=rl --pacing=true --gso=true --window=15000000 --max_cwnd_mss=4294967295 --num_streams=1 --port=11000 --num_server_worker=1 --host=127.0.0.1 -cc_env_agg='time' -cc_env_time_window_ms='100' -cc_env_fixed_window_size='10' -cc_env_use_state_summary='True' -cc_env_history_size='16' -cc_env_norm_ms='100.0' -cc_env_norm_bytes='1000.0' -cc_env_actions='0,/2,-10,+10,*2' -cc_env_reward_log_ratio='True' -cc_env_reward_throughput_factor='1.0' -cc_env_reward_throughput_log_offset='1e-05' -cc_env_reward_delay_factor='0.75' -cc_env_reward_delay_log_offset='1e-05' -cc_env_reward_packet_loss_factor='0.0' -cc_env_reward_packet_loss_log_offset='1e-05' -cc_env_reward_max_delay='True' -cc_env_fixed_cwnd=10 -cc_env_min_rtt_window_length_us='10000000000' -congestion='rl' -cc_env_job_id='-1' -cc_env_mode=local -cc_env_model_file="/mnt/disks/mvfst-rl/mvfst-rl/checkpoint/run/2021-06-16_14-10-20/traced_model.pt"
I0616 14:23:27.749056 10414 tperf.cpp:361] TPerfAcceptObserver attached
I0616 14:23:27.749156 10413 tperf.cpp:651] tperf server started at: 127.0.0.1:11000
I0616 14:23:36.103754 10414 RLCongestionControllerFactory.h:36] Creating RLCongestionController
I0616 14:23:36.103912 10414 CongestionControlLocalEnv.cpp:21] Loading traced model from /mnt/disks/mvfst-rl/mvfst-rl/checkpoint/run/2021-06-16_14-10-20/traced_model.pt
terminate called after throwing an instance of 'c10::Error'
  what():  open file failed, file path: /mnt/disks/mvfst-rl/mvfst-rl/checkpoint/run/2021-06-16_14-10-20/traced_model.pt
Exception raised from FileAdapter at ../caffe2/serialize/file_adapter.cc:11 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7fb0c0061b89 in /mnt/disks/mvfst-rl/mvfst-rl/_build/deps/libtorch/lib/libc10.so)
frame #1: caffe2::serialize::FileAdapter::FileAdapter(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x2f5 (0x7fb0c204d855 in /mnt/disks/mvfst-rl/mvfst-rl/_build/deps/libtorch/lib/libtorch_cpu.so)
frame #2: torch::jit::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x40 (0x7fb0c3371800 in /mnt/disks/mvfst-rl/mvfst-rl/_build/deps/libtorch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x29f943 (0x55b3bcb5b943 in ./tperf)
frame #4: <unknown function> + 0x292217 (0x55b3bcb4e217 in ./tperf)
frame #5: <unknown function> + 0x8d803 (0x55b3bc949803 in ./tperf)
frame #6: <unknown function> + 0x245ed7 (0x55b3bcb01ed7 in ./tperf)
frame #7: <unknown function> + 0x247057 (0x55b3bcb03057 in ./tperf)
frame #8: <unknown function> + 0x20cd41 (0x55b3bcac8d41 in ./tperf)
frame #9: <unknown function> + 0x1f5ca1 (0x55b3bcab1ca1 in ./tperf)
frame #10: <unknown function> + 0x20820c (0x55b3bcac420c in ./tperf)
frame #11: <unknown function> + 0x208d7b (0x55b3bcac4d7b in ./tperf)
frame #12: <unknown function> + 0x209432 (0x55b3bcac5432 in ./tperf)
frame #13: <unknown function> + 0xbb583 (0x55b3bc977583 in ./tperf)
frame #14: <unknown function> + 0x2113f (0x7fb0bfdc013f in /lib/x86_64-linux-gnu/libevent-2.1.so.7)
frame #15: event_base_loop + 0x52f (0x7fb0bfdc087f in /lib/x86_64-linux-gnu/libevent-2.1.so.7)
frame #16: <unknown function> + 0xc1caf (0x55b3bc97dcaf in ./tperf)
frame #17: <unknown function> + 0xc2155 (0x55b3bc97e155 in ./tperf)
frame #18: <unknown function> + 0xc3a38 (0x55b3bc97fa38 in ./tperf)
frame #19: <unknown function> + 0x3aa933 (0x55b3bcc66933 in ./tperf)
frame #20: <unknown function> + 0xc819d (0x7fb0bf89a19d in /home/zidan/anaconda3/envs/mvfst-rl/lib/libstdc++.so.6)
frame #21: <unknown function> + 0x9609 (0x7fb0bf9a0609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #22: clone + 0x43 (0x7fb0bf59f293 in /lib/x86_64-linux-gnu/libc.so.6)

*** Aborted at 1623853416 (Unix time, try 'date -d @1623853416') ***
*** Signal 6 (SIGABRT) (0x3ea000028ad) received by PID 10413 (pthread TID 0x7fb0bf0d3700) (linux TID 10414) (maybe from PID 10413, UID 1002) (code: -6), stack trace: ***
(error retrieving stack trace)
odelalleau commented 3 years ago
  1. Sorry about that, don't try using the model models/traced_model.pt, it's a very old model which I had completely forgotten about, so it's definitely outdated. I'll either remove or update it in the future.
  2. In the second case, it seems like it can't actually find the model file, are you sure the path and permissions are correct?
Zidan241 commented 3 years ago

@odelalleau I got a question about tperf code,

  1. I am trying to calculate the packet loss percentage, I have tried using the following 2 functions to calculate the number of packets generated and number of packets lost, but the results seems off, https://github.com/odelalleau/mvfst/blob/479abf19f71138bd629ae8e2379434b0836443e6/quic/tools/tperf/tperf.cpp#L302 https://github.com/odelalleau/mvfst/blob/479abf19f71138bd629ae8e2379434b0836443e6/quic/tools/tperf/tperf.cpp#L310

is this the correct way?

  1. Does the following line, corresponds to the actual amount of sent bytes? Because when I add packet loss the difference between this value and the amount of received bytes do not change. https://github.com/odelalleau/mvfst/blob/479abf19f71138bd629ae8e2379434b0836443e6/quic/tools/tperf/tperf.cpp#L470
odelalleau commented 3 years ago

Unfortunately I'm not really familiar with tperf:

  1. Currently in the mvfst-rl project we obtain the loss from within the onPacketAckOrLoss() congestion controller method (https://github.com/facebookincubator/mvfst/blob/3890e656f5007ed64abe0605356c090f4aa57d6e/quic/state/StateData.h#L341). You can combine the data from the LossEvent with what's in the AckEvent to compute a loss %

  2. Not sure -- you could also look at this function: https://github.com/odelalleau/mvfst/blob/479abf19f71138bd629ae8e2379434b0836443e6/quic/api/QuicTransportFunctions.cpp#L97

odelalleau commented 3 years ago
  1. I am trying to calculate the packet loss percentage,

I had a closer look, if you call sock_->getTransportInfo() on this line: https://github.com/odelalleau/mvfst/blob/479abf19f71138bd629ae8e2379434b0836443e6/quic/tools/tperf/tperf.cpp#L409 then you get access to a bunch of useful data on the connection, including packets marked as lost (see https://github.com/facebookincubator/mvfst/blob/db4121c5f665cb849ff6177120858d6fe3e02ebe/quic/api/QuicSocket.h#L141)

Zidan241 commented 3 years ago

@odelalleau, After some time testing, and analyzing the results, i had some questions: 1- It is stated in the paper that the model has a degraded performance in high speed networks, when that model is trained with low speed experiments, can you please elaborate on that From my testing I observed very low utilization in the high speed experiments, close to the 'mvfst-rl' fixed, and sometimes even similar results to an untrained model (a 100 step model) which i cannot conclude why this happened, are there any advancements concerning this problem

2-I have also observed bad performance situations with experiments with added packet loss, the utilization is lower than normal, how is that even though the reward function ignores packet loss.

3- Testing out the packet loss factor in the reward function, why is the default value zero, why not take packet loss into consideration?

odelalleau commented 3 years ago
  1. Which part of the paper are you specifically referring to? If you train only on low bandwidth environments you can’t expect the model to work well on high bandwidth ones.

  2. I’m not sure. I’ve recently trained models successfully with packet loss, though I’m now using a different reward function. Generally speaking, I found the reward defined as log(throughput) - b*log(delay) to be difficult to work with in order to get a high throughput across a variety of network conditions. If you want to update to the latest version of the code you can try the mtenv branch, but (1) I’d suggest a full reinstall and (2) let me know if you do so and I can share some training command lines that worked well for me. Be aware that this branch contains significant changes to the reward and model inputs.

  3. In the past it didn’t help much when I experimented with it. What happens is that penalizing delay already prevents the queue from getting full and thus is usually enough to avoid packet loss.

Zidan241 commented 3 years ago
  1. After observing the actions taken while testing with high speed experiments, the +10 action was always taken, even though the best action would be 2. The model is trained for both high and low speeds, so the 2 should have been chosen at least once but that did not happen
  2. It would be really interesting to test the mtenv branch, could you please share the training command lines
odelalleau commented 3 years ago
  1. It's possible that when training in both low- and high-speed envs, the model is reluctant to use the 2 action because it could be bad in low-speed envs. It may take a while for it to realize that 2 is more optimal in high-speed ones (also depending on the reward and gamma, it may not make a big difference in terms of discounted return).

  2. You can try the command line below. Note that it trains only on (randomized versions of) high-speed envs (I haven't tried it yet on low-speed ones). You may try with fewer steps to get results faster (it will train slower than before, because of the new option empirical_base_rtt=True, that executes a "dummy" run before each episode to obtain the average link RTT):

python3 -m train.train \
mode=train \
test_after_train=false \
jobs@train_jobs=random_traces_3_5 \
jobs@eval_jobs=fixed_0_5 \
num_actors=20 \
num_actors_eval=2 \
train_job_ids="[0]" \
eval_job_ids="[3,4,5]" \
total_steps=13_000_000 \
hidden_size=256 \
unroll_length=8 \
inference_batch_size=1 \
seed=2 \
use_lstm=true \
use_reward=false \
shared_actor_critic=true \
activations='[tanh,tanh]' \
use_task_obs_in_actor=false \
use_task_obs_in_critic=true \
end_of_episode_bootstrap=true \
entropy_cost=5e-3 \
baseline_cost=0.5 \
discounting=0.95 \
reward_clipping=none \
reward_normalization=false \
reward_normalization_coeff=1e-4 \
reward_normalization_stats_per_job=false \
learning_rate=5e-4 \
alpha=0.99 \
momentum=1e-4 \
empirical_base_rtt=true \
cc_env_norm_bytes=1000000 \
cc_env_norm_ms=100 \
cc_env_history_size=4 \
cc_env_reward_delay_factor=1 \
cc_env_reward_formula=cwnd_tradeoff \
cc_env_reward_uplink_queue_max_fill_ratio=0.5 \
cc_env_bandwidth_min_window_duration_ms=100 \
cc_env_ack_delay_avg_coeff=0.1