Problem with reproduce - Githubissues

Zopek commented 1 month ago

Hi,

Great work!

I've been trying to reproduce the result of PQCL, but seems that there was something wrong. The linear evaluation dropped a lot, as well as video segmentation on DAVIS. I simply run the command you provided on the ReadMe page, except that I run it for 300 epochs (which is 100 in ReadMe) using 4 H800, with 256 per GPU batch size. Can you provide your training log? Or any suggestions?

Thank you!

Sherrylone commented 1 month ago

Thank you for your attention to our work. I have been leaving the company for a long time and cannot receive any training logs anymore. Could you please provide the linear probing logs?

Zopek commented 3 weeks ago

This is the script I used during training and there is no modification on main_pos_bot.py `work_path=$(dirname $0) cuda=${1} nproc=${2} port=${3}

export NCCL_P2P_DISABLE=1

CUDA_VISIBLE_DEVICES=${cuda} torchrun --nnodes=1 --nproc_per_node=${nproc} \ --rdzv_endpoint=localhost:${port} \ main_pos_bot.py \ --arch pos_small \ --output_dir ${work_path} \ --local_crops_number 1 \ --local_crops_scale 0.05 0.25 \ --global_crops_scale 0.25 1 \ --pred_ratio 0 0.3 \ --norm_last_layer false \ --shared_head true \ --pred_ratio_var 0 0.2 \ --lambda3 1.0 \ --batch_size_per_gpu 256 \ --lambda2 1.0 \ --epochs 300 \ --warmup_teacher_temp_epochs 30 \ --teacher_query_temp 0.07 \ --teacher_temp 0.07 \ --local_crops_size 96 `

And this is the training log of the last several epochs {"train_loss": 13.93879449110237, "train_cls": 4.12317182072442, "train_patch": 2.688462989888698, "train_query": 7.127159676010565, "train_lr": 6.293658838036681e-06, "train_wd": 0.3991091093675686, "train_acc": 0.2497845473621103, "epoch": 290} {"train_loss": 13.936377887245563, "train_cls": 4.122193886317032, "train_patch": 2.689692791846159, "train_query": 7.12449122532952, "train_lr": 5.2396220615144845e-06, "train_wd": 0.39928650559223694, "train_acc": 0.2507517423561151, "epoch": 291} {"train_loss": 13.931289474264705, "train_cls": 4.119701529054238, "train_patch": 2.689206654457547, "train_query": 7.122381287227146, "train_lr": 4.302383503102697e-06, "train_wd": 0.39944424103118964, "train_acc": 0.2507384717226219, "epoch": 292} {"train_loss": 13.917234142144903, "train_cls": 4.1177092250779, "train_patch": 2.6776100828779117, "train_query": 7.121914837095472, "train_lr": 3.482053151901624e-06, "train_wd": 0.39958229838696196, "train_acc": 0.2512037245203837, "epoch": 293} {"train_loss": 13.923193702118384, "train_cls": 4.118192544372247, "train_patch": 2.6845185687120776, "train_query": 7.120482596466772, "train_lr": 2.778727277315148e-06, "train_wd": 0.39970066252000747, "train_acc": 0.2514855303257394, "epoch": 294} {"train_loss": 13.921008727723937, "train_cls": 4.116335338349346, "train_patch": 2.685277534569863, "train_query": 7.1193958594263504, "train_lr": 2.192488417753049e-06, "train_wd": 0.3997993204503649, "train_acc": 0.25182197991606714, "epoch": 295} {"train_loss": 13.918440585323184, "train_cls": 4.114713040425433, "train_patch": 2.6841392959241004, "train_query": 7.1195882481636765, "train_lr": 1.7234053709447292e-06, "train_wd": 0.39987826135908094, "train_acc": 0.2522700589528377, "epoch": 296} {"train_loss": 13.917914131562487, "train_cls": 4.115429038433529, "train_patch": 2.6844180872400316, "train_query": 7.11806700974822, "train_lr": 1.3715331858655515e-06, "train_wd": 0.3999374765893954, "train_acc": 0.25211549510391684, "epoch": 297} {"train_loss": 13.915188806329509, "train_cls": 4.114010483717366, "train_patch": 2.68220154754073, "train_query": 7.118976771688575, "train_lr": 1.1369131562765518e-06, "train_wd": 0.39997695964769064, "train_acc": 0.25166585481614706, "epoch": 298} {"train_loss": 13.926348005839102, "train_cls": 4.114735400791076, "train_patch": 2.692723241927241, "train_query": 7.118889363644887, "train_lr": 1.0195728158784548e-06, "train_wd": 0.3999967062042038, "train_acc": 0.2517119117206235, "epoch": 299}

And this is the linear probling logs {"train_lr": 0.0019999999999998686, "train_loss": 3.45624516339384, "epoch": 0, "test_loss": 2.6753284364100307, "test_acc1": 41.664, "test_acc5": 68.332} {"train_lr": 0.0019995065603657376, "train_loss": 2.7223180120128467, "epoch": 1, "test_loss": 2.4534510669805814, "test_acc1": 45.474, "test_acc5": 71.508} {"train_lr": 0.0019980267284282105, "train_loss": 2.5833351559344644, "epoch": 2, "test_loss": 2.3569963937220364, "test_acc1": 47.08, "test_acc5": 72.83} {"train_lr": 0.001995561964603092, "train_loss": 2.511002474958974, "epoch": 3, "test_loss": 2.294022228559265, "test_acc1": 48.248, "test_acc5": 73.836} {"train_lr": 0.0019921147013145773, "train_loss": 2.463838992419359, "epoch": 4, "test_loss": 2.250250446979347, "test_acc1": 49.238, "test_acc5": 74.424} {"train_lr": 0.0019876883405950175, "train_loss": 2.42648845765765, "epoch": 5, "test_loss": 2.2187873759233128, "test_acc1": 49.832, "test_acc5": 74.94} {"train_lr": 0.0019822872507288198, "train_loss": 2.398190236536922, "epoch": 6, "test_loss": 2.193830579443051, "test_acc1": 50.242, "test_acc5": 75.34} {"train_lr": 0.0019759167619387524, "train_loss": 2.378181972934206, "epoch": 7, "test_loss": 2.1745708378989375, "test_acc1": 50.642, "test_acc5": 75.598} {"train_lr": 0.001968583161128624, "train_loss": 2.3575687844349393, "epoch": 8, "test_loss": 2.15258232315483, "test_acc1": 51.0, "test_acc5": 75.98} {"train_lr": 0.001960293685677003, "train_loss": 2.342759156650988, "epoch": 9, "test_loss": 2.140194590744155, "test_acc1": 51.336, "test_acc5": 75.962} {"train_lr": 0.0019510565162951365, "train_loss": 2.326756249081836, "epoch": 10, "test_loss": 2.130149286268922, "test_acc1": 51.442, "test_acc5": 76.208} {"train_lr": 0.0019408807689541316, "train_loss": 2.31476406082124, "epoch": 11, "test_loss": 2.115688769408809, "test_acc1": 51.762, "test_acc5": 76.496} {"train_lr": 0.0019297764858882515, "train_loss": 2.303858286742444, "epoch": 12, "test_loss": 2.1036707774147656, "test_acc1": 52.07, "test_acc5": 76.692} {"train_lr": 0.0019177546256839834, "train_loss": 2.29136329199903, "epoch": 13, "test_loss": 2.095672262751538, "test_acc1": 52.272, "test_acc5": 76.706} {"train_lr": 0.0018910065241883177, "train_loss": 2.2744961972361413, "epoch": 15, "test_loss": 2.0770511125664575, "test_acc1": 52.506, "test_acc5": 77.002} {"train_lr": 0.0018763066800438779, "train_loss": 2.26690293639553, "epoch": 16, "test_loss": 2.0735637121798134, "test_acc1": 52.506, "test_acc5": 77.092} {"train_lr": 0.0018607420270040137, "train_loss": 2.259162144950519, "epoch": 17, "test_loss": 2.066250185832343, "test_acc1": 52.794, "test_acc5": 77.128} {"train_lr": 0.001844327925502041, "train_loss": 2.249674050892728, "epoch": 18, "test_loss": 2.059746475856932, "test_acc1": 52.836, "test_acc5": 77.272} {"train_lr": 0.0018270805742745338, "train_loss": 2.2475890826212135, "epoch": 19, "test_loss": 2.0561059286527317, "test_acc1": 52.878, "test_acc5": 77.396} {"train_lr": 0.0018090169943749148, "train_loss": 2.2405754614366895, "epoch": 20, "test_loss": 2.0517443028252447, "test_acc1": 53.008, "test_acc5": 77.43} {"train_lr": 0.001790155012375684, "train_loss": 2.233257055306406, "epoch": 21, "test_loss": 2.0443843816552323, "test_acc1": 53.118, "test_acc5": 77.56} {"train_lr": 0.00177051324277586, "train_loss": 2.2291981870895663, "epoch": 22, "test_loss": 2.0396281171332844, "test_acc1": 53.326, "test_acc5": 77.624} {"train_lr": 0.001728968627421389, "train_loss": 2.2183095675633044, "epoch": 24, "test_loss": 2.02988460484673, "test_acc1": 53.458, "test_acc5": 77.74} {"train_lr": 0.0017071067811865767, "train_loss": 2.2135881979942704, "epoch": 25, "test_loss": 2.027728320540065, "test_acc1": 53.672, "test_acc5": 77.826} {"train_lr": 0.001637423989748733, "train_loss": 2.2022783923306277, "epoch": 28, "test_loss": 2.0179152854568208, "test_acc1": 53.774, "test_acc5": 77.848} {"train_lr": 0.001612907053652909, "train_loss": 2.1972024463149866, "epoch": 29, "test_loss": 2.0138325910738972, "test_acc1": 53.78, "test_acc5": 78.106} {"train_lr": 0.0015877852522924111, "train_loss": 2.190600351017092, "epoch": 30, "test_loss": 2.0107625161900238, "test_acc1": 53.848, "test_acc5": 77.956} {"train_lr": 0.0015358267949790963, "train_loss": 2.188844637434053, "epoch": 32, "test_loss": 2.005095142690117, "test_acc1": 53.932, "test_acc5": 78.07} {"train_lr": 0.0015090414157503675, "train_loss": 2.1826356470680124, "epoch": 33, "test_loss": 2.0016609069026643, "test_acc1": 54.094, "test_acc5": 78.126} {"train_lr": 0.0014257792915651636, "train_loss": 2.172691834363469, "epoch": 36, "test_loss": 1.9953247895631034, "test_acc1": 54.164, "test_acc5": 78.192} {"train_lr": 0.001397147890634744, "train_loss": 2.1696035133996965, "epoch": 37, "test_loss": 1.9914664077331952, "test_acc1": 54.308, "test_acc5": 78.236} {"train_lr": 0.0012789911060391633, "train_loss": 2.1593768834901055, "epoch": 41, "test_loss": 1.9835437376176, "test_acc1": 54.398, "test_acc5": 78.386} {"train_lr": 0.0012486898871647862, "train_loss": 2.1611585939973534, "epoch": 42, "test_loss": 1.9836572911733252, "test_acc1": 54.426, "test_acc5": 78.416} {"train_lr": 0.001218143241396569, "train_loss": 2.1568330939155174, "epoch": 43, "test_loss": 1.9815928628072714, "test_acc1": 54.526, "test_acc5": 78.458} {"train_lr": 0.0011873813145856604, "train_loss": 2.1544099425107444, "epoch": 44, "test_loss": 1.978697159558611, "test_acc1": 54.532, "test_acc5": 78.514} {"train_lr": 0.0011564344650402871, "train_loss": 2.153710473217204, "epoch": 45, "test_loss": 1.9785750241535704, "test_acc1": 54.634, "test_acc5": 78.506} {"train_lr": 0.0010627905195293463, "train_loss": 2.14949663525765, "epoch": 48, "test_loss": 1.972251180340262, "test_acc1": 54.734, "test_acc5": 78.62} {"train_lr": 0.0009058916866814987, "train_loss": 2.13642723932773, "epoch": 53, "test_loss": 1.966746737737485, "test_acc1": 54.798, "test_acc5": 78.664} {"train_lr": 0.0008746667664356905, "train_loss": 2.1355174361968867, "epoch": 54, "test_loss": 1.9659034911628879, "test_acc1": 54.814, "test_acc5": 78.674} {"train_lr": 0.0008126186854142688, "train_loss": 2.1335300625894624, "epoch": 56, "test_loss": 1.9628628617357415, "test_acc1": 54.83, "test_acc5": 78.742} {"train_lr": 0.0007818567586034925, "train_loss": 2.130477556551736, "epoch": 57, "test_loss": 1.9610874948599148, "test_acc1": 54.894, "test_acc5": 78.644} {"train_lr": 0.0007513101128351359, "train_loss": 2.1319769332397476, "epoch": 58, "test_loss": 1.9608589560174576, "test_acc1": 54.912, "test_acc5": 78.738} {"train_lr": 0.000721008893960812, "train_loss": 2.129030893671003, "epoch": 59, "test_loss": 1.9589605838289041, "test_acc1": 55.064, "test_acc5": 78.83} {"train_lr": 0.0006318754473153419, "train_loss": 2.1267056471697576, "epoch": 62, "test_loss": 1.95591099091503, "test_acc1": 55.114, "test_acc5": 78.82} {"train_lr": 0.00041221474770750024, "train_loss": 2.1183434658249616, "epoch": 70, "test_loss": 1.9505617863229474, "test_acc1": 55.126, "test_acc5": 78.886} {"train_lr": 0.00029289321881343744, "train_loss": 2.1142668228267527, "epoch": 75, "test_loss": 1.94945651597684, "test_acc1": 55.178, "test_acc5": 78.854} {"train_lr": 0.0002710313725785874, "train_loss": 2.115103325341827, "epoch": 76, "test_loss": 1.9493636531598122, "test_acc1": 55.192, "test_acc5": 78.9} {"train_lr": 0.0002498889303695469, "train_loss": 2.1134280195676274, "epoch": 77, "test_loss": 1.9487363000964875, "test_acc1": 55.208, "test_acc5": 78.926} {"train_lr": 0.0001556720744979767, "train_loss": 2.111991622545126, "epoch": 82, "test_loss": 1.9479377676763803, "test_acc1": 55.214, "test_acc5": 78.902} {"train_lr": 0.00013925797299606019, "train_loss": 2.1129869865480537, "epoch": 83, "test_loss": 1.9476479887200133, "test_acc1": 55.224, "test_acc5": 78.916} {"train_lr": 0.00012369331995613583, "train_loss": 2.1090548642744698, "epoch": 84, "test_loss": 1.9472213134436351, "test_acc1": 55.24, "test_acc5": 78.912} {"train_lr": 8.224537431602316e-05, "train_loss": 2.1112206636478175, "epoch": 87, "test_loss": 1.946871292103282, "test_acc1": 55.244, "test_acc5": 78.926} {"train_lr": 7.022351411175157e-05, "train_loss": 2.1080314371879227, "epoch": 88, "test_loss": 1.946782335067344, "test_acc1": 55.258, "test_acc5": 78.938} {"train_lr": 5.911923104577377e-05, "train_loss": 2.1095510911236657, "epoch": 89, "test_loss": 1.9466997360634377, "test_acc1": 55.264, "test_acc5": 78.94} {"train_lr": 4.8943483704848814e-05, "train_loss": 2.107927286882092, "epoch": 90, "test_loss": 1.946571450404194, "test_acc1": 55.286, "test_acc5": 78.914}

Sherrylone commented 3 weeks ago

There might be two problems. One is the number of GPUs used for pretraining, the overall batch size could influence the results. The other is the temperature for distillation loss (smaller could result in higher accuracies).

Zopek commented 3 weeks ago

Here I used batch_size_per_gpu = 256 and I used 4 H800 for pretraining, which is 256 * 4 = 1024 in total. Do you have any suggestion on what temperature should I use to reproduce the result?

Sherrylone commented 3 weeks ago

You can try 0.04.

Zopek commented 3 weeks ago

OK I'll try it. Thanks!

Sherrylone / PQCL

Problem with reproduce #3