As usual, lightning is nice but limited. It would be great to have a full pytorch version with deepspeed so that we can experiment with 3D parallelism. Early experiments suggest that gpt3 isn't scalable much past 32 nodes (polaris) using simply pytorch-lightning + ZeRO3 or FSDP. Perhaps scaling past this point requires pipeline/model parallism or tensor parallelism, both of which are not available via the lightning interface.
As usual, lightning is nice but limited. It would be great to have a full pytorch version with deepspeed so that we can experiment with 3D parallelism. Early experiments suggest that gpt3 isn't scalable much past 32 nodes (polaris) using simply pytorch-lightning + ZeRO3 or FSDP. Perhaps scaling past this point requires pipeline/model parallism or tensor parallelism, both of which are not available via the lightning interface.