Questions about training time

Henry1iu / TNT-Trajectory-Prediction

A Pytorch Implementation of TNT: Target-driveN Trajectory Prediction

487 stars 92 forks source link

Questions about training time #37

Closed merrye closed 1 year ago

merrye commented 1 year ago

Hi, thanks for sharing your great work. I split the train dataset of the Argoverse dataset into ten small parts, and it takes about 2 hours to train one epoch, and it is estimated to take about 83 days to train the complete dataset. How long is your training time? Can I know your hardware resources by the way? The following attachments are my commands and hardware resources. 1666350188215

python -m torch.distributed.launch --nproc_per_node=2 train_net.py -d dataset/interm_data -o run/net/ -a -b 8 -c -m --lr 0.0012 -luf 10 -ldr 0.3 -e 100 -w 40

Henry1iu commented 1 year ago

Hi,

Thank you for your appreciation. Will the different splits be loaded during the training?

If you are using the "ArgoverseInDisk" data loader, it will take much longer than the "ArgoverseInMem" data loader. Actually, I never finished the training with the "ArgoverseInDisk" data loader. T T

In my case, my desktop is installed with a 10700K intel CPU and two Nvidia RTX 2080 GPUs. Each training epoch takes about 20mins. Also, all the data is stored in an m.2 SDD.

I'm afraid the training speed is optimized to the fastest I can achieve.

Henry1iu commented 1 year ago

Increasing the Swap size and loading all the data at once using "ArgoverseInMem" data loader will accelerate your training, I assure you.

merrye commented 1 year ago

I have tried to load all the data at once using the "ArgoverseInMem" data loader, but it failed (I think it was lack of memory). Now I have to split the dataset. And I'm using the "ArgoverseInMem" data loader for training, but it still takes about 2 hours and I'm confused.

Henry1iu commented 1 year ago

Could you please explain the way you split the dataset? And will you load the different splits during the training?

merrye commented 1 year ago

Thanks for your support. I have solved it.