Open animesh-007 opened 2 years ago
@animesh-007
Can you point out what version of GradMatch you are using?
Ideally, subset selection should be faster unless something is wrong with the experimental setup. Please attach the log files so that I can figure out the issue after analyzing them.
@animesh-007
Can you point out what version of GradMatch you are using?
Ideally, subset selection should be faster unless something is wrong with the experimental setup. Please attach the log files so that I can figure out the issue after analyzing them.
@krishnatejakk These are the initial logs. Should I paste whole log? I cloned the repo on June 27. So I guess I am using the latest version.
[06/27 16:40:56] train_sl INFO: DotMap(setting='SL', is_reg=True, dataset=DotMap(name='cifar10', datadir='../storage', feature='dss', type='image'), dataloader=DotMap(shuffle=True, batch_size=256, pin_memory=True, num_workers=8), model=DotMap(architecture='ResNet50_224', type='pre-defined', numclasses=10), ckpt=DotMap(is_load=False, is_save=True, dir='results/', save_every=20), loss=DotMap(type='CrossEntropyLoss', use_sigmoid=False), optimizer=DotMap(type='sgd', momentum=0.9, lr=0.01, weight_decay=0.0005, nesterov=False), scheduler=DotMap(type='cosine_annealing', T_max=300), dss_args=DotMap(type='GradMatch', fraction=0.3, select_every=5, lam=0.5, selection_type='PerClassPerGradient', v1=True, valid=False, kappa=0, eps=1e-100, linear_layer=True), train_args=DotMap(num_epochs=300, device='cuda', print_every=1, results_dir='results/', print_args=['val_loss', 'val_acc', 'tst_loss', 'tst_acc', 'time'], return_args=[])) Files already downloaded and verified Files already downloaded and verified 18it [00:01, 10.12it/s] [06/27 16:41:12] train_sl INFO: Epoch: 1 , Validation Loss: 3.1551918701171875 , Validation Accuracy: 0.1914 , Test Loss: 3.5032728210449218 , Test Accuracy: 0.2142 , Timing: 7.0498366355896 15it [00:01, 10.10it/s] [06/27 16:41:21] train_sl INFO: Epoch: 2 , Validation Loss: 2.387578009033203 , Validation Accuracy: 0.3002 , Test Loss: 2.735808560180664 , Test Accuracy: 0.3253 , Timing: 6.075047492980957 1it [00:00, 6.72it/s] [06/27 16:41:31] train_sl INFO: Epoch: 3 , Validation Loss: 2.139058850097656 , Validation Accuracy: 0.3246 , Test Loss: 2.036042041015625 , Test Accuracy: 0.3344 , Timing: 6.058322191238403 8it [00:00, 10.55it/s] [06/27 16:41:41] train_sl INFO: Epoch: 4 , Validation Loss: 3.5549482177734375 , Validation Accuracy: 0.3576 , Test Loss: 2.480993505859375 , Test Accuracy: 0.3838 , Timing: 5.7214953899383545 9it [00:01, 8.89it/s] [06/27 16:41:50] train_sl INFO: Epoch: 5 , Validation Loss: 3.782627294921875 , Validation Accuracy: 0.3624 , Test Loss: 3.3407586791992188 , Test Accuracy: 0.39 , Timing: 5.925083160400391 12it [00:01, 10.56it/s] 4it [00:00, 8.48it/s] 5it [00:00, 10.51it/s] 15it [00:01, 11.37it/s] 16it [00:01, 11.82it/s] 18it [00:01, 13.24it/s] 11it [00:01, 9.24it/s] 7it [00:00, 7.66it/s] 2it [00:00, 12.04it/s] 15it [00:01, 11.98it/s] [06/27 16:58:59] train_sl INFO: Epoch: 6, GradMatch subset selection finished, takes 1028.8181.
@krishnatejakk I got the similar test results using cifar10 dataset and ResNet18 model. For one epoch training, "full dataset" took about 50 seconds, GradMatch and CRAIG took more than 100 seconds. Besides, GradMatch and CRAIG took about 100 seconds to select sub dataset in an epoch.
Can we preprocess the whole dataset first to get the weighted training sub dataset, and then train directly with the weighted sub dataset, which should shorten the training time. Is there an example about that?
Is there a faster sub dataset selection method? Thank you.
[Full dataset]: INFO: The length of dataloader: 2250 INFO: Training Timing: 50.17572069168091
[GradMatch]: INFO: The length of dataloader: 225 INFO: GradMatch subset selection finished, takes 99.8966. INFO: Training Timing: 104.97514295578003
[CRAIG]: INFO: The length of dataloader: 225 INFO: subset selection finished, takes 108.4812. INFO: Training Timing: 114.62646007537842
@shiyf129 What is the resolution of the images you are using while training? I am using 224x224.
@animesh-007 @shiyf129 I am working on the issue. We have recently updated the OMP version in GradMatch code which improves its performance further. However the new OMP version is making it slower in this case. I will debug why it is very slow in this case.
For faster training, One option is to use GradMatchPB (i.e., perBatch version) or revert back to previous OMP version in GradMatch strategy code below: https://github.com/decile-team/cords/blob/844f897ea4ed7e2f9c1453888022c281bb2091be/cords/selectionstrategies/SL/gradmatchstrategy.py#L6 In import statement, remove _V1 to revert back to previous version of OMP code
@shiyf129 What is the resolution of the images you are using while training? I am using 224x224.
I use the original cifar10 dataset, 32*32 image size
@krishnatejakk I test GradMatchPB algorithm and set v1=False to use previous OMP version. I compared the beginning 10 epoch training between GradMatchPB alogithm and full dataset training, the result shows GradMatchPB takes longer time, and the average accuracy is relatively low. Do you know the reason about it?
GradMatchPB
Full dataset training
dss_args=dict(type="GradMatchPB",
fraction=0.1,
select_every=20,
lam=0,
selection_type='PerBatch',
v1=False,
valid=False,
eps=1e-100,
linear_layer=True,
kappa=0),
GradMatchPB beginning 10 epoch training:
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
Index | Subset selection time (second) | A training epoch time (second) | Test Accuracy -- | -- | -- | -- 1 | 25.85 | 30.91 | 0.3588 2 | 25.61 | 30.72 | 0.3707 3 | 25.39 | 31.07 | 0.4201 4 | 28.71 | 34.43 | 0.4314 5 | 28.69 | 33.85 | 0.4748 6 | 25.81 | 31.17 | 0.485 7 | 29.03 | 34.72 | 0.4881 8 | 26.78 | 31.85 | 0.511 9 | 25.82 | 31.45 | 0.537 10 | 25.4 | 30.47 | 0.5535 Mean | 26.7 | 32.06 | 0.463
I tried to run some experiments as follows:
I am using scaled resolution images of cifar10 i.e 224x224 resolution and accordingly defined resnet50 architecture. Can you let me know how to speed up experiments 2 and 3? In general subset selection method should faster the whole training process right?