Change --drop_last to --drop_last_training_batch, applied only to the…

samiwilf commented 1 year ago

… training split. And, fix 2 extra val batches from queuing.

Tested running a short run following by a long run: export TOTAL_TRAINING_SAMPLES=4195197692 export BATCHSIZE=55296

torchx run -s local_cwd dist.ddp -j 1x8 --script dlrm_main.py -- \ --embedding_dim 128 \ --dense_arch_layer_sizes 512,256,128 \ --over_arch_layer_sizes 1024,1024,512,256,1 \ --in_memory_binary_criteo_path /home/ubuntu/mountpoint/1tb_numpy_contiguous_shuffled \ --batch_size $((BATCHSIZE / 8)) \ --test_batch_size 131072 \ --num_embeddings_per_feature 39884406,39043,17289,7420,20263,3,7120,1543,63,38532951,2953546,403346,10,2208,11938,155,4,976,14,39979771,25641295,39664984,585935,12972,108,36 \ --epochs 1 \ --pin_memory \ --mmap_mode \ --validation_freq_within_epoch 50 \ --learning_rate 24.0 \ --limit_train_batches 50 \ --limit_val_batches 50 \ --limit_test_batches 50 \ --lr_warmup_steps 2750 \ --lr_decay_start 49315 \ --lr_decay_steps 27772 \ --drop_last_training_batch

torchx run -s local_cwd dist.ddp -j 1x8 --script dlrm_main.py -- \ --embedding_dim 128 \ --dense_arch_layer_sizes 512,256,128 \ --over_arch_layer_sizes 1024,1024,512,256,1 \ --in_memory_binary_criteo_path /home/ubuntu/mountpoint/1tb_numpy_contiguous_shuffled \ --batch_size $((BATCHSIZE / 8)) \ --test_batch_size 131072 \ --num_embeddings_per_feature 39884406,39043,17289,7420,20263,3,7120,1543,63,38532951,2953546,403346,10,2208,11938,155,4,976,14,39979771,25641295,39664984,585935,12972,108,36 \ --epochs 1 \ --pin_memory \ --mmap_mode \ --validation_freq_within_epoch $((TOTAL_TRAINING_SAMPLES / (BATCHSIZE * 20))) \ --learning_rate 24.0 \ --lr_warmup_steps 2750 \ --lr_decay_start 49315 \ --lr_decay_steps 27772 \ --drop_last_training_batch