StrongResearch / isc-demos

Deep learning examples for the Instant Super Computer
11 stars 0 forks source link

MarianMT #41

Open adam-peaston-SC opened 10 months ago

adam-peaston-SC commented 10 months ago

Source / repo

[URL]

Model description

[DESCRIPTION]

Dataset

[DATASET]

Literature benchmark source

[URL]

Literature benchmark performance

[DESCRIPTION] [VALUE/S]

Strong Compute result achieved

[VALUE/S]

Basic training config (as applicable)

Nodes: [N] Epochs: [N] Effective batch size: [N] Learning rate: [L] Optimizer: [OPT]

Logs gist

[URL]

bensand commented 10 months ago

Size the effort on implementing.

What would it take to get C++ cycling - ~ballpark What would it take to get Pytorch version behaving like C++ Response needed is ballpark 1d/1w/1m/1y.

Do not spend more than 1hr on this estimatation. May want to discuss with Chris or Michael.

Next step after this will be to go to Translated and get a sense of the value and liklihood of us winning this work from them if we get it done.

StrongTanisha commented 10 months ago

initial look - needs some container even to install and run it properly versioning needs containers also

for cycling:

has some built-in checkpointing. save, load - already set potentially some config you can pass through on how often you want to save it then after, pretty good for cycling save-freq


probably do need a container to run this but after that, should be able to train

[containers integrated into Marian]

also, metrics need to be matched to theirs

StrongTanisha commented 10 months ago

getting containers sorted:

potentially changing architecture spend 1-3 days on containers (Calvin, Tim, Fennecs, James)

StrongTanisha commented 10 months ago

Multi-GPU: DONE Multi-Node: Close

Will need to integrate mpirun into the ISC before it can be run on the ISC

StrongTanisha commented 10 months ago

Facing similar/same problems: hanging on multi-node. Multi GPU still ok

Tomorrow, more fine-grained test for MPI (message passing, how nodes communicate)

Not entirely clear that there will be a successful pathway to getting it running on ISC

Nccl - NVIDIA collective communications library - GPU communication, e.g. how models average/sum etc. for distributed training

StrongTanisha commented 10 months ago