Gaffey / ExCP

Official implementation of ICML 2024 paper "ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking".
Apache License 2.0
40 stars 1 forks source link

The compression process abruptly aborted (8*1b model) #6

Closed NiuMa-1234 closed 2 months ago

NiuMa-1234 commented 3 months ago

Thank you very much for your work! I encountered a problem during the compression of 8*1b moe model and I wanna know if you encountered it on LLM and your solution. Any reply would be appreciated.

My test on Pythia-410M works very well. But when I change the model into a 8*1b moe, the program would stop with no warning or error when there still exists some keys which have not been compressed. Have you encountered this problem on LLM such as llama?

截屏2024-07-19 16 41 10

Gaffey commented 2 months ago

The printed information does not contain detailed error description. From the SIGTERM received, it can be inferred that the problem may be caused by insufficient memory. You can try to use a machine with larger memory, or try to compress different parts of the network for multiple times to avoid writing all parameters to the memory.

NiuMa-1234 commented 2 months ago

Thank you! I've found the error, which was caused by myself... I changed the device of KMeans from cpu to gpu but not specify the gpu id...So the program would always exit at this step. Now I'm able to compress the 8*1b model! Thank you very much!!! By the way, would the result of just 7x smaller than the original size be normal?

Gaffey commented 2 months ago

You may need to adjust the parameter for your model, since the distribution could be really different in MoE model. Adjust the prune_alpha and prune_beta in the compress_pythia.py and you may get the better compression ratio.

NiuMa-1234 commented 2 months ago

Understood!Thank you again for your help and kindness! By the way (sorry for so many btw..), do you think the matrixed parallel compression on all keys be possible ?I found that the compression on all keys follows a same pattern so I guess the parallel calculation might work...

Gaffey commented 2 months ago

The compression of each layer (or each key) in our approach is actually independent, so I think it can be parallelized.

NiuMa-1234 commented 2 months ago

Thank you! Looking forward for your next work!