Context collapse when running Llama3-70B

I am seeing contextual collapse after just a few messages when running llama3-70B on a Mac mini cluster. This only appears to affect 70B, and doesn't seem to affect 8B.

The environment has 4 Mac mini nodes, 2x M2 with 16GB of RAM, and 2x M2 Pro with 32GB of RAM. All are running macOS Sonoma 14.4, with Python 3.12.4 installed via brew and added to path. I made sure to pull the latest code from today, with dependencies re-installed as well on each node.

Nodes are started in sequential order, so that the first node is an M2 Pro node and is also used as the endpoint when hitting the API. I notice that if I start in a different order, the API doesn't return, although generation still succeeds out to the terminal.

First prompt:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "llama-3-70b",
     "messages": [{"role": "user", "content": "Write a brief history of the iPhone."}],                
     "temperature": 0.7
   }'

This always times out on first run, since the models sequentially pull across the nodes before generation starts. The terminal generates the following text:

Here is a brief history of the iPhone:

**2007: The First Generation**

The first iPhone was announced by Steve Jobs, the co-founder and CEO of Apple, on January 9, 2007, at the Macworld conference in San Francisco. The iPhone was a revolutionary device that combined a mobile phone, an iPod, and an internet communications device into one product. It was released to the public on June 29, 2007.

**2008: The 3G iPhone**

The second generation of iPhone, known as the iPhone 3G, was released in 2008. This device added support for 3G networks, allowing for faster data speeds.

**2009: The iPhone 3GS**

The third generation of iPhone, known as the iPhone 3GS, added video recording capabilities, improved battery life, and a faster processor.

**2010: The iPhone 4**

The fourth generation of iPhone, known as the iPhone 4, featured a new design, a higher resolution display, improved cameras, and a faster processor.

**2011: The iPhone 4S**

The fifth generationofi iPhone, known asiPhone 4S, added Siri, a virtual assistant, improved battery life, improved cameras, a faster processor.

Subsequent API calls return successfully, but seem to keep the context rather than treating the message as a new prompt:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "llama-3-70b",
     "messages": [{"role": "user", "content": "Please continue."}],
     "temperature": 0.7
   }'

{"id": "chatcmpl-1b2c6ebd-87dd-4b58-87aa-5dda0490b205", "object": "chat.completion", "created": 1721249846, "model": "llama-3-70b", "usage": {"prompt_tokens": 13, "completion_tokens": 257, "total_tokens": 270}, "choices": [{"message": {"role": "assistant", "content": "**2012: The iPhone 5**\n\nThe sixth generation of iPhone, known asiPhone 5, featured a larger display, improved cameras, a faster processor, improved battery life, a new Lightning connector.\n\n**2013: The iPhone 5C**\n\nThe seventh generationofiiPhone, known asiPhone 5C, featured a new design, improved cameras, a faster processor, improved battery life, a new Lightning connector.\n\n**2014: The iPhone 6**\n\nThe eighth generationofiiPhone,known asiPhone 6, featured a larger display, improved cameras,a faster processor, improved battery life,a new Lightning connector.\n\n**2015: The iPhone 6 Plus**\n\nThe ninth generationofiiPhone,known asiPhone 6 Plus, featured a new design, improved cameras,a faster processor,improved battery life,a new Lightning connector.\n\n**2017: The iPhone X**\n\nThe tenth generationofiiPhone,knownasiPhone X, featured a new design,improved cameras,a faster processor,improvedbattery life,a new Lightning connector.\n\n**2018: The iPhone XS**\n\nThe eleventh generationofiiPhone,knownasiPhone XS,featured a new design,improved cameras,a faster processor,improvedbattery life,a new Lightning connector.\n\n**2019: The"}, "logprobs": null, "finish_reason": "stop", "index": 0}]}%

You can see that the quality of generation degrades over time, certain grammatical/spelling errors also persist once introduced. Additionally, if I send a very different question, it ignores it and repeats the context of the previous question:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "llama-3-70b",
     "messages": [{"role": "user", "content": "Write about llamas."}],
     "temperature": 0.7
   }'
{"id": "chatcmpl-8b7b2f5e-6f59-4374-a6af-18e70b6639ef", "object": "chat.completion", "created": 1721249971, "model": "llama-3-70b", "usage": {"prompt_tokens": 15, "completion_tokens": 134, "total_tokens": 149}, "choices": [{"message": {"role": "assistant", "content": "**2020: The iPhone 12**\n\nThe twelfth generationofiiPhone,knownasiPhone 12,featured a new design,improved cameras,a faster processor,improvedbattery life,a new Lightning connector.\n\n**2021: The iPhone 13**\n\nThe thirteenth generationofiiPhone,knownasiPhone 13,featured a new design,improved cameras,a faster processor,improvedbattery life,a new Lightning connector.\n\n**2022: The iPhone 14**\n\nThe fourteenth generationofiiPhone,knownasiPhone 14,featured a new design,improved cameras,a faster processor,improvedbattery life,a new Lightning connector.\n\nAnd so on."}, "logprobs": null, "finish_reason": "stop", "index": 0}]}

First of all, thanks a lot for taking the time to run exo when it's still experimental. Most of all, thank you so much for making an issue - these help more than anything.

This is interesting. I'll need to investigate some more.

The first question to answer is a design question:

Do we want to keep the kv cache around between requests? ---> I'm pretty sure this is not how much inference APIs work - you have to send the entire context every time, however on the backend the inference server might keep the cache as an optimisation. Right now, exo keeps the cache around until you explicitly call reset_all via grpc (which isn't supported via the ChatGPT api endpoint.

Your point about the order you start the servers in: this is based on the node-id of each node when it's started. If you don't explicitly set the node-id then it will be randomly generated (this code is here: https://github.com/exo-explore/exo/blob/main/main.py#L16). Then, the ring memory weighted partitioning strategy will sort by node-id alphabetically (the code for that is here: https://github.com/exo-explore/exo/blob/main/exo/topology/ring_memory_weighted_partitioning_strategy.py).

If you want the nodes to start in the same order every time, specify a node-id when starting e.g. python3 main.py --node-id "node1" and python3 main.py --node-id "node2". In this case, node1 will always be first and node2 will be the tail node which you can query via the ChatGPT API endpoint.

I pushed a quality of life improvement so you can use the ChatGPT api endpoint from any node https://github.com/exo-explore/exo/commit/8a35fd83f6e07b51b62e0dbe49028c9ef5f0455b

@AlexCheema Awesome, I'll pull the latest on that.

With regards to the caching optimization mentioned, I believe that maintaining context between requests will break many apps that integrate with the OpenAI API, especially in multi-user contexts. For example, you may have an agent framework that chains different prompts together, with the assumption that each individual prompt's context is separate, to try to solve a problem from different angles. And if a cluster is being used by multiple users, you could have contamination between conversations fairly easily.

What may work is managing a shared set of context caches across the cluster, invoked when a prompt includes previously sent content at the beginning, though I'm not sure on the implementation details or if this is feasible.

Can you try this again @matt-pulsipher? I can't reproduce anymore, and I fixed a few things recently.

First of all, thanks a lot for taking the time to run exo when it's still experimental. Most of all, thank you so much for making an issue - these help more than anything.

This is interesting. I'll need to investigate some more.

The first question to answer is a design question:

Do we want to keep the kv cache around between requests? ---> I'm pretty sure this is not how much inference APIs work - you have to send the entire context every time, however on the backend the inference server might keep the cache as an optimisation. Right now, exo keeps the cache around until you explicitly call reset_all via grpc (which isn't supported via the ChatGPT api endpoint.

Your point about the order you start the servers in: this is based on the node-id of each node when it's started. If you don't explicitly set the node-id then it will be randomly generated (this code is here: https://github.com/exo-explore/exo/blob/main/main.py#L16). Then, the ring memory weighted partitioning strategy will sort by node-id alphabetically (the code for that is here: https://github.com/exo-explore/exo/blob/main/exo/topology/ring_memory_weighted_partitioning_strategy.py).

If you want the nodes to start in the same order every time, specify a node-id when starting e.g. python3 main.py --node-id "node1" and python3 main.py --node-id "node2". In this case, node1 will always be first and node2 will be the tail node which you can query via the ChatGPT API endpoint.

I don't have the full context of the issue, but one thing I noticed is that in current exo's model sharding, decoder layers are renamed based on sharded layers which complicates sharing kv cache between nodes. It would actually be better to just replace the decode layers with identity blocks if they are not part of the shard. https://github.com/mzbac/mlx_sharding/blob/main/server/model/deepseek_v2.py#L427-L431

In that way, each node would simply initialize its own kvcache and update it based on the sharded layers.

First of all, thanks a lot for taking the time to run exo when it's still experimental. Most of all, thank you so much for making an issue - these help more than anything. This is interesting. I'll need to investigate some more. The first question to answer is a design question:

Do we want to keep the kv cache around between requests? ---> I'm pretty sure this is not how much inference APIs work - you have to send the entire context every time, however on the backend the inference server might keep the cache as an optimisation. Right now, exo keeps the cache around until you explicitly call reset_all via grpc (which isn't supported via the ChatGPT api endpoint.

Your point about the order you start the servers in: this is based on the node-id of each node when it's started. If you don't explicitly set the node-id then it will be randomly generated (this code is here: https://github.com/exo-explore/exo/blob/main/main.py#L16). Then, the ring memory weighted partitioning strategy will sort by node-id alphabetically (the code for that is here: https://github.com/exo-explore/exo/blob/main/exo/topology/ring_memory_weighted_partitioning_strategy.py). If you want the nodes to start in the same order every time, specify a node-id when starting e.g. python3 main.py --node-id "node1" and python3 main.py --node-id "node2". In this case, node1 will always be first and node2 will be the tail node which you can query via the ChatGPT API endpoint.

I don't have the full context of the issue, but one thing I noticed is that in current exo's model sharding, decoder layers are renamed based on sharded layers which complicates sharing kv cache between nodes. It would actually be better to just replace the decode layers with identity blocks if they are not part of the shard. https://github.com/mzbac/mlx_sharding/blob/main/server/model/deepseek_v2.py#L427-L431

In that way, each node would simply initialize its own kvcache and update it based on the sharded layers.

I like your thinking @mzbac. This is much cleaner. Would you be interested in working on this and making it into a PR?

First of all, thanks a lot for taking the time to run exo when it's still experimental. Most of all, thank you so much for making an issue - these help more than anything. This is interesting. I'll need to investigate some more. The first question to answer is a design question:

Do we want to keep the kv cache around between requests? ---> I'm pretty sure this is not how much inference APIs work - you have to send the entire context every time, however on the backend the inference server might keep the cache as an optimisation. Right now, exo keeps the cache around until you explicitly call reset_all via grpc (which isn't supported via the ChatGPT api endpoint.

Your point about the order you start the servers in: this is based on the node-id of each node when it's started. If you don't explicitly set the node-id then it will be randomly generated (this code is here: https://github.com/exo-explore/exo/blob/main/main.py#L16). Then, the ring memory weighted partitioning strategy will sort by node-id alphabetically (the code for that is here: https://github.com/exo-explore/exo/blob/main/exo/topology/ring_memory_weighted_partitioning_strategy.py). If you want the nodes to start in the same order every time, specify a node-id when starting e.g. python3 main.py --node-id "node1" and python3 main.py --node-id "node2". In this case, node1 will always be first and node2 will be the tail node which you can query via the ChatGPT API endpoint.

I don't have the full context of the issue, but one thing I noticed is that in current exo's model sharding, decoder layers are renamed based on sharded layers which complicates sharing kv cache between nodes. It would actually be better to just replace the decode layers with identity blocks if they are not part of the shard. https://github.com/mzbac/mlx_sharding/blob/main/server/model/deepseek_v2.py#L427-L431 In that way, each node would simply initialize its own kvcache and update it based on the sharded layers.

I like your thinking @mzbac. This is much cleaner. Would you be interested in working on this and making it into a PR?

Yeah, this will require some refactoring of the current code. Let me see if I can get it working. Also, I think it would be beneficial in the future to share the cache when a node goes offline so that inference can continue on another node since all the kv caches have the same structure.

Awesome, let me know if you need any help / want to run anything by me!

Also, I love this idea - made an issue for it here: https://github.com/exo-explore/exo/issues/52

@AlexCheema It looks like I'm seeing the same behavior in the version pulled this morning, perhaps with a marginal improvement in quality over prompts. The context still seems to be shared across prompts, though it looks like there's an architectural change planned to address this:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "llama-3-70b",
     "messages": [{"role": "user", "content": "Write a brief history of the iPhone."}],
     "temperature": 0.7
   }'
{"detail": "Response generation timed out"}

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "llama-3-70b",
     "messages": [{"role": "user", "content": "Please continue."}],   
     "temperature": 0.7
   }'
{"id": "chatcmpl-ea1cbdb9-41d1-4b10-990a-5aae315ade03", "object": "chat.completion", "created": 1721658604, "model": "llama-3-70b", "system_fingerprint": "exo_0.0.1", "choices": [{"index": 0, "message": {"role": "assistant", "content": "**2012: The iPhone 5**\n\nThe sixth generation of iPhone, known asiPhone 5, featured a larger display, improved cameras, a faster processor, improved battery life, a new Lightning connector.\n\n**2013: The iPhone 5C**\n\nThe seventh generationofiiPhone, known asiPhone 5C, featured a new design, improved cameras, a faster processor, improved battery life, a new Lightning connector.\n\n**2014: The iPhone 6**\n\nThe eighth generationofiiPhone,known asiPhone 6, featured a larger display, improved cameras,a faster processor, improved battery life,a new Lightning connector.\n\n**2015: The iPhone 6 Plus**\n\nThe ninth generationofiiPhone,known asiPhone 6 Plus, featured a new design, improved cameras,a faster processor,improved battery life,a new Lightning connector.\n\n**2017: The iPhone X**\n\nThe tenth generationofiiPhone,knownasiPhone X, featured a new design,improved cameras,a faster processor,improvedbattery life,a new Lightning connector.\n\n**2018: The iPhone XS**\n\nThe eleventh generationofiiPhone,knownasiPhone XS,featured a new design,improved cameras,a faster processor,improvedbattery life,a new Lightning connector.\n\n**2019: The"}, "logprobs": null, "finish_reason": "length"}], "usage": {"prompt_tokens": 13, "completion_tokens": 257, "total_tokens": 270}}%

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "llama-3-70b",
     "messages": [{"role": "user", "content": "Write about llamas."}],
     "temperature": 0.7
   }'
{"id": "chatcmpl-83f54495-53b3-4249-9ac1-8086362311ee", "object": "chat.completion", "created": 1721658668, "model": "llama-3-70b", "system_fingerprint": "exo_0.0.1", "choices": [{"index": 0, "message": {"role": "assistant", "content": "**2020: The iPhone 12**\n\nThe twelfth generationofiiPhone,knownasiPhone 12,featured a new design,improved cameras,a faster processor,improvedbattery life,a new Lightning connector.\n\n**2021: The iPhone 13**\n\nThe thirteenth generationofiiPhone,knownasiPhone 13,featured a new design,improved cameras,a faster processor,improvedbattery life,a new Lightning connector.\n\n**2022: The iPhone 14**\n\nThe fourteenth generationofiiPhone,knownasiPhone 14,featured a new design,improved cameras,a faster processor,improvedbattery life,a new Lightning connector.\n\nAnd so on."}, "logprobs": null, "finish_reason": "stop"}], "usage": {"prompt_tokens": 15, "completion_tokens": 134, "total_tokens": 149}}

exo-explore / exo

Context collapse when running Llama3-70B #23