Closed eugene-kharitonov closed 4 years ago
@robertodessi jfyi
Performance measurements: /1/ 8 GPUs distributed
(egg38) kharitonov@machine:~/work/EGG$ cat egg/zoo/channel/params.json
{
"vocab_size": 5,
"n_features": 1000,
"n_epoch": 5,
"batch_size": 640,
"max_len": 30,
"random_seed": 21,
"batches_per_epoch": 1000,
"probs": "powerlaw",
"sender_cell": "lstm",
"receiver_cell": "lstm",
"sender_entropy_coeff": 1,
"sender_hidden": 100,
"receiver_hidden": 100
}
python -m egg.nest.nest --game=egg.zoo.channel.train --sweep=egg/zoo/channel/params.json --nodes=1 --task=8
runs in 7 minutes (effective batch size 5120).
/2/ 2 GPUs distributed
time python -m torch.distributed.launch --use_env --nproc_per_node=2 egg/zoo/channel/train.py --vocab_size=5 --n_features=1000 --n_epoch=5 --max_len=30 --batch_size=2560 --random_seed=21 --batches_per_epoch=1000 --probs=powerlaw --sender_cell=lstm --receiver_cell=lstm --sender_entropy_coeff=1 --sender_hidden=100 --receiver_hidden=100
runs in 10m 44s (effective batch size 5120).
/3/ 1 GPU non-distributed
time python egg/zoo/channel/train.py --vocab_size=5 --n_features=1000 --n_epoch=5 --max_len=30 --batch_size=5120 --random_seed=21 --batches_per_epoch=1000 --probs=powerlaw --sender_cell=lstm --receiver_cell=lstm --sender_entropy_coeff=1 --sender_hidden=100 --receiver_hidden=100
15 minutes
Distributed data parallel training:
TODO: