facebookresearch / EGG

EGG: Emergence of lanGuage in Games
MIT License
290 stars 101 forks source link

Distributed training #92

Closed eugene-kharitonov closed 4 years ago

eugene-kharitonov commented 4 years ago

Distributed data parallel training:

TODO:

eugene-kharitonov commented 4 years ago

@robertodessi jfyi

eugene-kharitonov commented 4 years ago

Performance measurements: /1/ 8 GPUs distributed

(egg38) kharitonov@machine:~/work/EGG$ cat egg/zoo/channel/params.json
{
  "vocab_size": 5,
  "n_features": 1000,
  "n_epoch": 5,
  "batch_size": 640,
  "max_len": 30,
  "random_seed": 21,
  "batches_per_epoch": 1000,
  "probs": "powerlaw",
  "sender_cell": "lstm",
  "receiver_cell": "lstm",
  "sender_entropy_coeff": 1,
  "sender_hidden": 100,
  "receiver_hidden": 100
}

python -m egg.nest.nest --game=egg.zoo.channel.train --sweep=egg/zoo/channel/params.json --nodes=1 --task=8

runs in 7 minutes (effective batch size 5120).

/2/ 2 GPUs distributed

time python -m torch.distributed.launch --use_env --nproc_per_node=2  egg/zoo/channel/train.py --vocab_size=5 --n_features=1000 --n_epoch=5 --max_len=30 --batch_size=2560 --random_seed=21 --batches_per_epoch=1000 --probs=powerlaw --sender_cell=lstm --receiver_cell=lstm --sender_entropy_coeff=1 --sender_hidden=100 --receiver_hidden=100

runs in 10m 44s (effective batch size 5120).

/3/ 1 GPU non-distributed

time python egg/zoo/channel/train.py --vocab_size=5 --n_features=1000 --n_epoch=5 --max_len=30 --batch_size=5120 --random_seed=21 --batches_per_epoch=1000 --probs=powerlaw --sender_cell=lstm --receiver_cell=lstm --sender_entropy_coeff=1 --sender_hidden=100 --receiver_hidden=100

15 minutes