Closed guanzhchen closed 2 months ago
Hi, Thanks for your awesome work. In my test on 8xA800, why using USP with ulysses_degree=8 and ring_degree=1 would take more GPU memory than naive Ulysses?
All2All needs some tmp buffer for async P2P. could you post the memory difference? It is very small according to my experience.
Hi, Thanks for your awesome work. In my test on 8xA800, why using USP with ulysses_degree=8 and ring_degree=1 would take more GPU memory than naive Ulysses?