I use two gpu to run code on vqav2 dataset using movie_mcan model, the gpu memory is not enough so the batch_size is set to 16, but every time I run the code will cause the server abnormal lag, I use sar -d 3 5 to check the disk read and write, I found that the read speed is very fast, how to improve this problem, when the lag I can't do any operation.
This is my training code
CUDA_VISIBLE_DEVICES=2,3 mmf_run config=projects/movie_mcan/configsqa2/defaults.yaml model=movie_mcan dataset=vqa2 run_type=train env.cache_dir=/data/students/zzj/ env.data_dir=/data/students/zzj/ training.batch_size=16
❓ Questions and Help
I use two gpu to run code on vqav2 dataset using movie_mcan model, the gpu memory is not enough so the batch_size is set to 16, but every time I run the code will cause the server abnormal lag, I use sar -d 3 5 to check the disk read and write, I found that the read speed is very fast, how to improve this problem, when the lag I can't do any operation. This is my training code CUDA_VISIBLE_DEVICES=2,3 mmf_run config=projects/movie_mcan/configsqa2/defaults.yaml model=movie_mcan dataset=vqa2 run_type=train env.cache_dir=/data/students/zzj/ env.data_dir=/data/students/zzj/ training.batch_size=16
Here are the read and write speeds
![51AFL@1@S9U~RZH48G5P7AD](https://user-images.githubusercontent.com/56525319/196317109-a66d65ca-eecc-4256-be5f-0f70e711b8ea.png)