Training time taking too long

ArghyaChatterjee commented 1 year ago

Hi,

The training time is taking too long. I have a 40k annotated dataset of ironrod created using NViSII. With 64 GB ram, single Nvidia RTX 3060 6GB graphics, it took around 6 hours to generate 2 epochs of training. Getting that 60 epochs of training for that single object will take quite a long time. Can we minimize that time?

I am using this script for training inside train2 folder.

python3 -m torch.distributed.launch --nproc_per_node=1 train.py --network dope --epochs 2 --batchsize 10 --outf tmp/ --data ../nvisii_data_gen/output/output_example/

TontonTremblay commented 1 year ago

Yeah I am sorry, this network is pretty heavy in terms of training since it has 6 sets of conv layers. I had plans to make a light version but I never got around doing it. 3060 is a pretty small GPU in the scheme of all the GPUs available but the other ones are pricey. You could try google collab?

ArghyaChatterjee commented 1 year ago

Is that related to learning rate for adam optimizer ? The default is 0.0001. Should I make it to 0.001 for faster convergence ?
I can see total 5 different networks are present. Is that for testing purpose w.r.t dope ? I mean how much efficient dope is in terms of detection with respect to other architectures ? (f.e resnet, full etc.)
Also will increasing (say 20) or decreasing (say 5) affect the training time ?

if opt.network == 'resnetsimple':
    net = ResnetSimple()
    output_size = 208

elif opt.network == 'dope':
    net = DopeNetwork()
    output_size = 50
    opt.sigma = 0.5

elif opt.network == 'full':
    net = ()
    output_size = 400
    opt.sigma = 2
    net = DreamHourglassMultiStage(
        9,
        n_stages = 2,
        internalize_spatial_softmax = False,
        deconv_decoder = False,
        full_output = True)

elif opt.network == 'mobile':
    net = ()
    output_size = 50
    opt.sigma = 0.5
    net = DopeMobileNet() 

elif opt.network == 'boundary':

    # if not opt.net_dope is None:
    #     net_dope = DopeNetwork()
    #     net_dope = torch.nn.DataParallel(net_dope).cuda()
    #     tmp = torch.load(opt.net_dope)
    #     net_dope.load_state_dict(tmp)

    net = BoundaryAwareNet(opt.net_dope)
    output_size = 50
    opt.sigma = 1

else:
    print(f'network {opt.network} does not exists')
    quit()

TontonTremblay commented 1 year ago

Good eye, you should try mobile, I dont remember how it performed.

You should let me know how it performs :D if you try it.

You are right regarding the lr. You should try higher, but it will still take a while to go through the data, the same amount of time. But you might get results faster.

NVlabs / Deep_Object_Pose

Training time taking too long #296