decile-team / cords

Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.
https://cords.readthedocs.io/en/latest/
MIT License
316 stars 53 forks source link

Gradmatch Data subset selection method making training slow #78

Open animesh-007 opened 2 years ago

animesh-007 commented 2 years ago

I tried to run some experiments as follows:

I am using scaled resolution images of cifar10 i.e 224x224 resolution and accordingly defined resnet50 architecture. Can you let me know how to speed up experiments 2 and 3? In general subset selection method should faster the whole training process right?

krishnatejakk commented 2 years ago

@animesh-007

Can you point out what version of GradMatch you are using?

Ideally, subset selection should be faster unless something is wrong with the experimental setup. Please attach the log files so that I can figure out the issue after analyzing them.

animesh-007 commented 2 years ago

@animesh-007

Can you point out what version of GradMatch you are using?

Ideally, subset selection should be faster unless something is wrong with the experimental setup. Please attach the log files so that I can figure out the issue after analyzing them.

@krishnatejakk These are the initial logs. Should I paste whole log? I cloned the repo on June 27. So I guess I am using the latest version.

[06/27 16:40:56] train_sl INFO: DotMap(setting='SL', is_reg=True, dataset=DotMap(name='cifar10', datadir='../storage', feature='dss', type='image'), dataloader=DotMap(shuffle=True, batch_size=256, pin_memory=True, num_workers=8), model=DotMap(architecture='ResNet50_224', type='pre-defined', numclasses=10), ckpt=DotMap(is_load=False, is_save=True, dir='results/', save_every=20), loss=DotMap(type='CrossEntropyLoss', use_sigmoid=False), optimizer=DotMap(type='sgd', momentum=0.9, lr=0.01, weight_decay=0.0005, nesterov=False), scheduler=DotMap(type='cosine_annealing', T_max=300), dss_args=DotMap(type='GradMatch', fraction=0.3, select_every=5, lam=0.5, selection_type='PerClassPerGradient', v1=True, valid=False, kappa=0, eps=1e-100, linear_layer=True), train_args=DotMap(num_epochs=300, device='cuda', print_every=1, results_dir='results/', print_args=['val_loss', 'val_acc', 'tst_loss', 'tst_acc', 'time'], return_args=[])) Files already downloaded and verified Files already downloaded and verified 18it [00:01, 10.12it/s] [06/27 16:41:12] train_sl INFO: Epoch: 1 , Validation Loss: 3.1551918701171875 , Validation Accuracy: 0.1914 , Test Loss: 3.5032728210449218 , Test Accuracy: 0.2142 , Timing: 7.0498366355896 15it [00:01, 10.10it/s] [06/27 16:41:21] train_sl INFO: Epoch: 2 , Validation Loss: 2.387578009033203 , Validation Accuracy: 0.3002 , Test Loss: 2.735808560180664 , Test Accuracy: 0.3253 , Timing: 6.075047492980957 1it [00:00, 6.72it/s] [06/27 16:41:31] train_sl INFO: Epoch: 3 , Validation Loss: 2.139058850097656 , Validation Accuracy: 0.3246 , Test Loss: 2.036042041015625 , Test Accuracy: 0.3344 , Timing: 6.058322191238403 8it [00:00, 10.55it/s] [06/27 16:41:41] train_sl INFO: Epoch: 4 , Validation Loss: 3.5549482177734375 , Validation Accuracy: 0.3576 , Test Loss: 2.480993505859375 , Test Accuracy: 0.3838 , Timing: 5.7214953899383545 9it [00:01, 8.89it/s] [06/27 16:41:50] train_sl INFO: Epoch: 5 , Validation Loss: 3.782627294921875 , Validation Accuracy: 0.3624 , Test Loss: 3.3407586791992188 , Test Accuracy: 0.39 , Timing: 5.925083160400391 12it [00:01, 10.56it/s] 4it [00:00, 8.48it/s] 5it [00:00, 10.51it/s] 15it [00:01, 11.37it/s] 16it [00:01, 11.82it/s] 18it [00:01, 13.24it/s] 11it [00:01, 9.24it/s] 7it [00:00, 7.66it/s] 2it [00:00, 12.04it/s] 15it [00:01, 11.98it/s] [06/27 16:58:59] train_sl INFO: Epoch: 6, GradMatch subset selection finished, takes 1028.8181.

shiyf129 commented 2 years ago

@krishnatejakk I got the similar test results using cifar10 dataset and ResNet18 model. For one epoch training, "full dataset" took about 50 seconds, GradMatch and CRAIG took more than 100 seconds. Besides, GradMatch and CRAIG took about 100 seconds to select sub dataset in an epoch.

  1. Can we preprocess the whole dataset first to get the weighted training sub dataset, and then train directly with the weighted sub dataset, which should shorten the training time. Is there an example about that?

  2. Is there a faster sub dataset selection method? Thank you.

[Full dataset]: INFO: The length of dataloader: 2250 INFO: Training Timing: 50.17572069168091

[GradMatch]: INFO: The length of dataloader: 225 INFO: GradMatch subset selection finished, takes 99.8966. INFO: Training Timing: 104.97514295578003

[CRAIG]: INFO: The length of dataloader: 225 INFO: subset selection finished, takes 108.4812. INFO: Training Timing: 114.62646007537842

animesh-007 commented 2 years ago

@shiyf129 What is the resolution of the images you are using while training? I am using 224x224.

krishnatejakk commented 2 years ago

@animesh-007 @shiyf129 I am working on the issue. We have recently updated the OMP version in GradMatch code which improves its performance further. However the new OMP version is making it slower in this case. I will debug why it is very slow in this case.

For faster training, One option is to use GradMatchPB (i.e., perBatch version) or revert back to previous OMP version in GradMatch strategy code below: https://github.com/decile-team/cords/blob/844f897ea4ed7e2f9c1453888022c281bb2091be/cords/selectionstrategies/SL/gradmatchstrategy.py#L6 In import statement, remove _V1 to revert back to previous version of OMP code

shiyf129 commented 2 years ago

@shiyf129 What is the resolution of the images you are using while training? I am using 224x224.

I use the original cifar10 dataset, 32*32 image size

shiyf129 commented 2 years ago

@krishnatejakk I test GradMatchPB algorithm and set v1=False to use previous OMP version. I compared the beginning 10 epoch training between GradMatchPB alogithm and full dataset training, the result shows GradMatchPB takes longer time, and the average accuracy is relatively low. Do you know the reason about it?

GradMatchPB

Full dataset training

dss_args=dict(type="GradMatchPB",
            fraction=0.1,
            select_every=20,
            lam=0,
            selection_type='PerBatch',
            v1=False,
            valid=False,
            eps=1e-100,
            linear_layer=True,
            kappa=0),

GradMatchPB beginning 10 epoch training:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Index | Subset selection time (second) | A training epoch time (second) | Test Accuracy -- | -- | -- | -- 1 | 25.85 | 30.91 | 0.3588 2 | 25.61 | 30.72 | 0.3707 3 | 25.39 | 31.07 | 0.4201 4 | 28.71 | 34.43 | 0.4314 5 | 28.69 | 33.85 | 0.4748 6 | 25.81 | 31.17 | 0.485 7 | 29.03 | 34.72 | 0.4881 8 | 26.78 | 31.85 | 0.511 9 | 25.82 | 31.45 | 0.537 10 | 25.4 | 30.47 | 0.5535 Mean | 26.7 | 32.06 | 0.463

Full dataset beginning 10 epoch training:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Index | A training epoch time (second) | Test Accuracy -- | -- | -- 1 | 51.59 | 0.5279 2 | 52.13 | 0.6543 3 | 50.17 | 0.7183 4 | 51.26 | 0.7495 5 | 51.62 | 0.7779 6 | 50.14 | 0.8205 7 | 47.99 | 0.8026 8 | 51.54 | 0.8324 9 | 49.91 | 0.8229 10 | 52.32 | 0.8423 Mean | 50.867 | 0.7548

krishnatejakk commented 2 years ago

@shiyf129 why is subset selection happening every epoch? We usually set it to 20. Subset selection takes some time and you dont need to select a subset every time.

Furthermore, training with 10% subset should be 10x faster than full dataset training. From your logs, it doesn't seem that way. Can you check if you create a 10% subset of dataset and train on it for one epoch, is it 10x faster than full training?

shiyf129 commented 2 years ago

@krishnatejakk I modified the code to select a subset every 20 epoches. I run the cifar10 dataset on ResNet18 model to compare GradMatchPB and Full dataset. Both of them run for 10 minutes and record the test accuracy every minute. The average test accuracy of GradMatchPB is slightly lower than that of full dataset. What is the reason for this?

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | Full dataset | GradMatchPB (fraction=0.3) | GradMatchPB (fraction=0.1) -- | -- | -- | -- Average test accuracy | 0.7633 | 0.7515 | 0.6714