Questions about training speed and accuracy

wforange commented 8 months ago

Hi, Thank you for your great work! I have 2 questions about RepVit_0_9.

What's your training speed? Per my experience by using 1 GPU device(NVIDIA A10), 1 epoch takes me about 1hour40mins? How about yours?
By using the default setting(except changed lr as 6e-3)，my accurate is about ~6% lower. Any idea for this? {"train_lr": 0.002968938101879723, "train_loss": 3.9505264261167206, "test_loss": 2.1110704195408423, "test_acc1": 52.82000147460938, "test_acc5": 77.69200212890625, "epoch": 14, "n_parameters": 5489328} {"train_lr": 0.0029639955590981108, "train_loss": 3.918420416702755, "test_loss": 2.0485606557540312, "test_acc1": 53.86400150390625, "test_acc5": 78.40400258789063, "epoch": 15, "n_parameters": 5489328} {"train_lr": 0.0029586930309943257, "train_loss": 3.9049760365276502, "test_loss": 2.0295982697537838, "test_acc1": 54.110001552734374, "test_acc5": 78.68800255859375, "epoch": 16, "n_parameters": 5489328}

jameslahm commented 7 months ago

Thanks for your interest!

We use 8*3090 GPU and 1 epoch takes about 6.5 minutes.
Do you also change the batch size?

wforange commented 7 months ago

Thank you for your answer. I didn't modify the batch size, the default setting is 256. Can you answer about the accuracy?

jameslahm commented 7 months ago

Why do you change the lr to 6e-3? What's the accuracy if keeping the lr to the default value?

wforange commented 7 months ago

If I use default setting I got below log, the acc1 increased too slow. By compared to you provided, I guess the lr should be 6e-3. but still can't repeat your result.

train_lr as default value:

{"train_lr": 1.000000000000068e-06, "train_loss": 6.975921413690733, "test_loss": 6.928075437327378, "test_acc1": 0.14400001098632811, "test_acc5": 0.5980000273132324, "epoch": 0, "n_parameters": 5489328} {"train_lr": 1.000000000000068e-06, "train_loss": 6.962617924649843, "test_loss": 6.91879555287252, "test_acc1": 0.1540000099182129, "test_acc5": 0.7040000312805176, "epoch": 1, "n_parameters": 5489328} {"train_lr": 0.00010080000000000871, "train_loss": 6.845906076576117, "test_loss": 6.599413409487892, "test_acc1": 2.626000080413818, "test_acc5": 9.092000283966064, "epoch": 2, "n_parameters": 5489328} {"train_lr": 0.00020059999999999599, "train_loss": 6.619759896676317, "test_loss": 5.937542263788122, "test_acc1": 7.484000213470459, "test_acc5": 21.026000695800782, "epoch": 3, "n_parameters": 5489328} {"train_lr": 0.0003003999999999828, "train_loss": 6.369503548677019, "test_loss": 5.4622787956063075, "test_acc1": 12.124000384216309, "test_acc5": 30.72400082397461, "epoch": 4, "n_parameters": 5489328} {"train_lr": 0.00040020000000002843, "train_loss": 6.231803081971374, "test_loss": 5.275543744327458, "test_acc1": 13.990000392456055, "test_acc5": 34.42800095703125, "epoch": 5, "n_parameters": 5489328} {"train_lr": 0.0004996642360148386, "train_loss": 6.2141018290218595, "test_loss": 5.2779140254013415, "test_acc1": 14.554000370788573, "test_acc5": 34.336000876464844, "epoch": 6, "n_parameters": 5489328} {"train_lr": 0.0004995165484649184, "train_loss": 6.166435195006532, "test_loss": 5.2440667662001745, "test_acc1": 15.882000408935546, "test_acc5": 35.872000935058594, "epoch": 7, "n_parameters": 5489328} {"train_lr": 0.0004993420469200044, "train_loss": 6.121721277324607, "test_loss": 5.166219835062973, "test_acc1": 17.77600058227539, "test_acc5": 38.030001118164066, "epoch": 8, "n_parameters": 5489328} {"train_lr": 0.0004991407505161498, "train_loss": 6.071527244566346, "test_loss": 5.061290801026439, "test_acc1": 19.700000533447266, "test_acc5": 39.95200100585937, "epoch": 9, "n_parameters": 5489328} {"train_lr": 0.0004989126813277368, "train_loss": 5.998429657076951, "test_loss": 4.927277659641877, "test_acc1": 21.396000593261718, "test_acc5": 41.930001088867186, "epoch": 10, "n_parameters": 5489328} {"train_lr": 0.0004986578643652291, "train_loss": 5.921012676686501, "test_loss": 4.805169545967161, "test_acc1": 22.998000694580078, "test_acc5": 43.72200132080078, "epoch": 11, "n_parameters": 5489328} {"train_lr": 0.0004983763275721029, "train_loss": 5.853875551006491, "test_loss": 4.682203596784868, "test_acc1": 24.362000673217775, "test_acc5": 45.34600124023437, "epoch": 12, "n_parameters": 5489328} {"train_lr": 0.0004980681018220224, "train_loss": 5.783869908391524, "test_loss": 4.5477353980523025, "test_acc1": 26.17600067504883, "test_acc5": 47.71600116210937, "epoch": 13, "n_parameters": 5489328} {"train_lr": 0.0004977332209154644, "train_loss": 5.721688020238869, "test_loss": 4.486863227290962, "test_acc1": 27.65400068847656, "test_acc5": 48.98200126953125, "epoch": 14, "n_parameters": 5489328} {"train_lr": 0.0004973717215759342, "train_loss": 5.658488622672266, "test_loss": 4.34144630595928, "test_acc1": 28.79400069213867, "test_acc5": 51.46200160644531, "epoch": 15, "n_parameters": 5489328} {"train_lr": 0.0004969836434458476, "train_loss": 5.614673731805419, "test_loss": 4.251058745930213, "test_acc1": 30.60800080810547, "test_acc5": 52.408001640625, "epoch": 16, "n_parameters": 5489328} {"train_lr": 0.0004965690290822709, "train_loss": 5.553136842904522, "test_loss": 4.1680190090004725, "test_acc1": 31.874000888671876, "test_acc5": 54.086001625976564, "epoch": 17, "n_parameters": 5489328} {"train_lr": 0.0004961279239524144, "train_loss": 5.5050668902248505, "test_loss": 4.131608025718282, "test_acc1": 32.57400111328125, "test_acc5": 54.0420015234375, "epoch": 18, "n_parameters": 5489328} {"train_lr": 0.0004956603764285549, "train_loss": 5.464044562441935, "test_loss": 3.9814893926372963, "test_acc1": 34.338000909423826, "test_acc5": 56.23800159179687, "epoch": 19, "n_parameters": 5489328} {"train_lr": 0.0004951664377823055, "train_loss": 5.417946175991489, "test_loss": 3.9336773275419046, "test_acc1": 34.86800102783203, "test_acc5": 57.3200017578125, "epoch": 20, "n_parameters": 5489328} {"train_lr": 0.0004946461621798025, "train_loss": 5.378094382518582, "test_loss": 3.869185467712752, "test_acc1": 36.356000913085936, "test_acc5": 58.55800174804688, "epoch": 21, "n_parameters": 5489328}

train_lr as 6e-3: (from this log the acc is lower but the "train_lr" looks close to yours) {"train_lr": 1.000000000000068e-06, "train_loss": 6.975648766608356, "test_loss": 6.928090586917091, "test_acc1": 0.13200000961303712, "test_acc5": 0.5940000251770019, "epoch": 0, "n_parameters": 5489328} {"train_lr": 1.000000000000068e-06, "train_loss": 6.962368514707429, "test_loss": 6.918480509110079, "test_acc1": 0.15000001007080077, "test_acc5": 0.6980000308227539, "epoch": 1, "n_parameters": 5489328} {"train_lr": 0.0006007999999999656, "train_loss": 6.620017554834306, "test_loss": 5.765148104602144, "test_acc1": 6.708000193786621, "test_acc5": 19.28400057739258, "epoch": 2, "n_parameters": 5489328} {"train_lr": 0.001200599999999996, "train_loss": 6.381126131371056, "test_loss": 5.265676980710212, "test_acc1": 10.510000318145751, "test_acc5": 26.136000775146485, "epoch": 3, "n_parameters": 5489328} {"train_lr": 0.0018003999999998855, "train_loss": 6.080490042170365, "test_loss": 4.298012950038182, "test_acc1": 19.412000533447266, "test_acc5": 42.00200125, "epoch": 4, "n_parameters": 5489328} {"train_lr": 0.002400200000000178, "train_loss": 5.456123369774944, "test_loss": 3.406271823489939, "test_acc1": 30.55600078125, "test_acc5": 56.50800171875, "epoch": 5, "n_parameters": 5489328} {"train_lr": 0.002995391413930926, "train_loss": 4.966046765267992, "test_loss": 2.9429522397863956, "test_acc1": 37.300001052246095, "test_acc5": 63.85200161132813, "epoch": 6, "n_parameters": 5489328} {"train_lr": 0.0029933651370815054, "train_loss": 4.572101024510287, "test_loss": 2.654454188492462, "test_acc1": 42.448001533203126, "test_acc5": 68.69600208984374, "epoch": 7, "n_parameters": 5489328} {"train_lr": 0.0029909716284052725, "train_loss": 4.368374443740296, "test_loss": 2.472962500484845, "test_acc1": 45.830001333007814, "test_acc5": 71.76800223632813, "epoch": 8, "n_parameters": 5489328} {"train_lr": 0.0029882114784651735, "train_loss": 4.25342661752213, "test_loss": 2.341590005477876, "test_acc1": 48.44200149902344, "test_acc5": 73.8480021875, "epoch": 9, "n_parameters": 5489328} {"train_lr": 0.0029850853682860134, "train_loss": 4.151384915474603, "test_loss": 2.268401017170826, "test_acc1": 50.06200151855469, "test_acc5": 75.13000250976563, "epoch": 10, "n_parameters": 5489328} {"train_lr": 0.002981594069189961, "train_loss": 4.077764600825062, "test_loss": 2.2367773802225828, "test_acc1": 50.4900013671875, "test_acc5": 75.66000243164062, "epoch": 11, "n_parameters": 5489328} {"train_lr": 0.002977738442601217, "train_loss": 4.029604343916301, "test_loss": 2.1607110409336237, "test_acc1": 51.660001689453125, "test_acc5": 77.13800265625, "epoch": 12, "n_parameters": 5489328} {"train_lr": 0.0029735194398394713, "train_loss": 3.9884388382486304, "test_loss": 2.132885098912334, "test_acc1": 52.4360014453125, "test_acc5": 77.34800232421875, "epoch": 13, "n_parameters": 5489328} {"train_lr": 0.002968938101879723, "train_loss": 3.9505264261167206, "test_loss": 2.1110704195408423, "test_acc1": 52.82000147460938, "test_acc5": 77.69200212890625, "epoch": 14, "n_parameters": 5489328} {"train_lr": 0.0029639955590981108, "train_loss": 3.918420416702755, "test_loss": 2.0485606557540312, "test_acc1": 53.86400150390625, "test_acc5": 78.40400258789063, "epoch": 15, "n_parameters": 5489328} {"train_lr": 0.0029586930309943257, "train_loss": 3.9049760365276502, "test_loss": 2.0295982697537838, "test_acc1": 54.110001552734374, "test_acc5": 78.68800255859375, "epoch": 16, "n_parameters": 5489328} {"train_lr": 0.00295303182588744, "train_loss": 3.8662152901637277, "test_loss": 2.0141297328563135, "test_acc1": 54.602001484375, "test_acc5": 78.94600254882812, "epoch": 17, "n_parameters": 5489328}

jameslahm commented 7 months ago

Do you only train the model for 22 and 18 epochs, respectively?

wforange commented 7 months ago

Yes, I the train logs are generated one by one. Because the training takes 1.5 hours per epoch on my environment, so I only train few epochs.

jameslahm commented 7 months ago

So the lower accuracy is expected, given that fewer epochs are trained.

wforange commented 7 months ago

No, I mean if I compared the accuracy value by each epoch, I should get same result right? For example, by comparing the epoch No. 15, I see difference. I don't think later training can compensate this. mine: {"train_lr": 0.0029639955590981108, "train_loss": 3.918420416702755, "test_loss": 2.0485606557540312, "testacc1": 53.86400150390625,_ "test_acc5": 78.40400258789063, "epoch": 15, "n_parameters": 5489328} your's: {"train_lr": 0.002983962137779637, "train_loss": 3.6223566518556014, "test_loss": 1.7595897195014087, "test_acc1": 59.484000420837404, "test_acc5": 82.38600054382324, "epoch": 15, "n_parameters": 5489328}

jameslahm commented 7 months ago

Thanks. The reason may lie in the small batch size and we suggest that you could reproduce the results according to the default configuration.

wforange commented 7 months ago

Thank you for your answer, so you thought 8 machines with 8*256 batch size increased the speed of convergence of accuracy?

jameslahm commented 7 months ago

Yes.

wforange commented 7 months ago

Got it, thank you.

THU-MIG / RepViT

Questions about training speed and accuracy #43