Discrepancies in Training Time: Longer Than Initially Reported

arplaboratory / learning-to-fly

Training transferable end-to-end quadrotor control policies on a laptop in 18 seconds.

MIT License

387 stars 35 forks source link

Discrepancies in Training Time: Longer Than Initially Reported #2

Closed jc-bao closed 8 months ago

jc-bao commented 9 months ago

I've successfully installed the repository as per the provided instructions with Intel MKL. However, upon initiating training with ./build/src/training_headless, the execution time seems to be noticeably longer than what was indicated in the original paper. I am using intel i9-13900kf to train the task in Ubuntu, which is supposed to be better than the laptop used in the paper.

In my most recent run, the reported training time was approximately 60 seconds, which doesn't align with the benchmark results.

To give you an idea, here's a snippet of the resulting logs:

Step: 300000 (mean return: -91.2982, mean episode length: 459.899)
Training took: 59.8677s

Could anyone kindly tell me in understanding whether I am missing something here or not using the correct configuration? I am interested in accurately replicating the paper's results with the referenced speed.

Thanks for any anticipated guidance!

jonas-eschmann commented 9 months ago

Hi @jc-bao thanks a lot for reaching out! The stated training speed is attained using Apple Silicon laptops. On the M1 it takes 18s (I also recently tested it on an M3 and there it is around 15s). On Intel hardware (even using MKL) it is slower, probably due to the lack of the lack of a dedicated matrix coprocessor (cf the enumeration in the README of https://github.com/corsix/amx). Intel "only" has AVX which is the SIMD/NEON equivalent.

On an i9-10885H when running the training_benchmark target (which is the one used for measuring the time) I'm getting 56.561s.

Fun fact: When running docker run -it --rm -p 8100:8000 --entrypoint /bin/bash arpllab/learning_to_fly -c "/build/src/training_benchmark" on an M3 I get a competitive 80.9878s even though it is using emulation to run the x86 binaries inside the docker container.

I have experimented around a fair amount but that is the maximum performance I could get out of the Intel i9 for now.

jc-bao commented 9 months ago

Thank you for your prompt response! Your clarification makes sense. Given that the code incorporates CUDA, would you be open to incorporating CUDA support as well? This addition would greatly enhance the efficiency of the codebase.

jc-bao commented 9 months ago

I'm also wondering if is it possible to flash the firmware and use cfclient directly on a Mac, or is a separate computer required for these tasks? I'd love some guidance on this. Thanks!

jonas-eschmann commented 9 months ago

In general, RLtools supports CUDA well (cf e.g. this example). We did benchmarks with our simulator implemented in CUDA and found it to be extremely fast. But off-policy RL (we're using TD3) is not necessarily simulation speed bound so overall we didn't find the training process to benefit much from using CUDA. Also, the CUDA kernels for training are probably not as well-tuned as the ones for our simulator. So yeah long story short: tuning the CUDA kernels for training and squeezing out more performance is on the todo list but since it is already so fast on Macbooks (which are arguably more wide-spread than PCs with CUDA GPUs) it is not the # 1 prio.

Yes you can definitely do everything on macOS as well (we did approximately 50/50 of the development on Ubuntu and macOS). cfclient works out of the box IIRC and for compiling and flashing they have instructions here.

jc-bao commented 9 months ago

Great to hear your feedback! I completely understand the challenges of fine-tuning CUDA. However, training a neural network on a CPU can indeed feel less intuitive, especially considering the potential computational constraints. Is it because the model and data used in Crazyflie training are relatively small? I also noticed a low CPU utilization on my Intel machine, and even the single-core frequency doesn't boost up. Might be the driver issue though.

Regarding Crazyflie development on a Mac, thanks for confirming the platform! I am wondering if you can use Crazyradio on Apple Silicon Mac correctly? Thank you!

jc-bao commented 9 months ago

I have also tried to train with my M1 Pro Macbook and it do reach the claimed speed.

Training took: 19.8022s

jonas-eschmann commented 8 months ago

Great to hear! The M1 really is a beautiful machine 🙂 Not that it makes a big difference, but if you close all other processes and use sudo nice -n -20 to elevate the priority you should be able to get it down to 18

jc-bao commented 8 months ago

Yeah, I do run it with the nice command. Since everything is working now, I believe this issue has been addressed. Thank you!