Open adam-peaston-SC opened 10 months ago
Size the effort on implementing.
What would it take to get C++ cycling - ~ballpark What would it take to get Pytorch version behaving like C++ Response needed is ballpark 1d/1w/1m/1y.
Do not spend more than 1hr on this estimatation. May want to discuss with Chris or Michael.
Next step after this will be to go to Translated and get a sense of the value and liklihood of us winning this work from them if we get it done.
initial look - needs some container even to install and run it properly versioning needs containers also
for cycling:
has some built-in checkpointing. save, load - already set potentially some config you can pass through on how often you want to save it then after, pretty good for cycling save-freq
probably do need a container to run this but after that, should be able to train
[containers integrated into Marian]
also, metrics need to be matched to theirs
getting containers sorted:
potentially changing architecture spend 1-3 days on containers (Calvin, Tim, Fennecs, James)
Multi-GPU: DONE Multi-Node: Close
Will need to integrate mpirun into the ISC before it can be run on the ISC
Facing similar/same problems: hanging on multi-node. Multi GPU still ok
Tomorrow, more fine-grained test for MPI (message passing, how nodes communicate)
Not entirely clear that there will be a successful pathway to getting it running on ISC
Nccl - NVIDIA collective communications library - GPU communication, e.g. how models average/sum etc. for distributed training
Source / repo
[URL]
Model description
[DESCRIPTION]
Dataset
[DATASET]
Literature benchmark source
[URL]
Literature benchmark performance
[DESCRIPTION] [VALUE/S]
Strong Compute result achieved
[VALUE/S]
Basic training config (as applicable)
Nodes: [N] Epochs: [N] Effective batch size: [N] Learning rate: [L] Optimizer: [OPT]
Logs gist
[URL]