Closed brendanofallon closed 3 years ago
Not sure if it's because of the system is already caching the files, but when I benchmarked the train subcommand for a subset of Pregen files on CHPC, this branch is taking longer (about 375 seconds per epoch) than the current master branch (about 317 seconds per epoch):
(torch) [u6005925@kingspeak2 deepl]$ grep "Epoch " slurm-3334482.with-sample-cache.out
[10-25 09:11:45] train INFO Epoch 0 Secs: 779.54 lr: 0.0010 loss: 849.6832 train acc: 0.9372 val accuracy: 0.9053, val VAF accuracy: 0.7883
[10-25 09:18:12] train INFO Epoch 1 Secs: 374.56 lr: 0.0010 loss: 841.7675 train acc: 0.9372 val accuracy: 0.9069, val VAF accuracy: 0.1982
[10-25 09:24:39] train INFO Epoch 2 Secs: 374.97 lr: 0.0010 loss: 835.5685 train acc: 0.9346 val accuracy: 0.9136, val VAF accuracy: 0.1092
[10-25 09:31:07] train INFO Epoch 3 Secs: 375.79 lr: 0.0010 loss: 833.2199 train acc: 0.9333 val accuracy: 0.9136, val VAF accuracy: 0.0887
[10-25 09:37:34] train INFO Epoch 4 Secs: 375.73 lr: 0.0010 loss: 832.1494 train acc: 0.9205 val accuracy: 0.9150, val VAF accuracy: 0.0873
(torch) [u6005925@kingspeak2 deepl]$ grep "Epoch " slurm-3334511.current-master.out
[10-25 09:50:19] train INFO Epoch 0 Secs: 360.36 lr: 0.0010 loss: 848.7621 train acc: 0.9372 val accuracy: 0.9108, val VAF accuracy: 0.7882
[10-25 09:55:49] train INFO Epoch 1 Secs: 317.88 lr: 0.0010 loss: 840.0670 train acc: 0.9372 val accuracy: 0.9176, val VAF accuracy: 0.0906
[10-25 10:01:18] train INFO Epoch 2 Secs: 316.99 lr: 0.0010 loss: 829.6251 train acc: 0.9372 val accuracy: 0.9300, val VAF accuracy: 0.1037
[10-25 10:06:47] train INFO Epoch 3 Secs: 317.41 lr: 0.0010 loss: 823.0594 train acc: 0.9372 val accuracy: 0.9313, val VAF accuracy: 0.1050
[10-25 10:12:16] train INFO Epoch 4 Secs: 316.94 lr: 0.0010 loss: 820.9989 train acc: 0.9372 val accuracy: 0.9343, val VAF accuracy: 0.1051
[10-25 10:17:45] train INFO Epoch 5 Secs: 317.67 lr: 0.0010 loss: 818.5588 train acc: 0.9372 val accuracy: 0.9365, val VAF accuracy: 0.1056
Yeah something funny is going on for the CHPC, I'm getting weirdly slow results. Maybe its using a ton of swap or something? I'll look into it...
Adds a simple caching mechanism for PregenLoader that stores raw file contents (without decompressing) in RAM, instead of loading them from disk every time. We still decompress in parallel, and don't cache the decompressed data. I think this will be helpful on systems with lots of RAM but not super fast IO (like kingspeak machines) Max cache size can be set from command line with the
--max-cache-size
option, which defaults to 1000, but could probably be pretty big (like 10,000+?) on kingspeak machines