bigscience-workshop / petals

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
https://petals.dev
MIT License
9.14k stars 514 forks source link

Inference timeout on larger input prompts #250

Closed gururise closed 1 year ago

gururise commented 1 year ago

Currently using the chatbot example where session id is saved and inference occurs one token at a time:

    with models[model_name][1].inference_session(max_length=512) as sess:
        print(f"Thread Start -> {threading.get_ident()}")
        output[model_name] = ""
        inputs = models[model_name][0](prompt, return_tensors="pt")["input_ids"].to(DEVICE)
        n_input_tokens = inputs.shape[1]
        done = False
        while not done and not kill.is_set():
            outputs = models[model_name][1].generate(
                inputs, 
                max_new_tokens=1, 
                do_sample=True, 
                top_p=top_p, 
                temperature=temperature, 
                repetition_penalty=repetition_penalty,
                session=sess
            )
            output[model_name] += models[model_name][0].decode(outputs[0, n_input_tokens:])
            token_cnt += 1
            print("\n["+ str(threading.get_ident()) + "]" + output[model_name], end="", flush=True)

            for stop_word in stop:
                stop_word = codecs.getdecoder("unicode_escape")(stop_word)[0]
                if stop_word != '' and stop_word in output[model_name]:
                    print(f"\nDONE (stop) -> {threading.get_ident()}")
                    done = True
            if flag or (token_cnt >= max_tokens):
                print(f"\nDONE (max tokens) -> {threading.get_ident()}")
                done = True
            inputs = None  # Prefix is passed only for the 1st token of the bot's response
            n_input_tokens = 0

When I pass in a small prompt, inference works:

PROMPT

Please answer the following question: Question: What is the capital of Germany? Answer:

Berlin, Germany

A slightly larger prompt always results in timeout errors:

PROMPT

Given a pair of sentences, choose whether the two sentences agree (entailment)/disagree (contradiction) with each other. Possible labels: 1. entailment 2. contradiction Sentence 1: The skier was on the edge of the ramp. Sentence 2: The skier was dressed in winter clothes. Label: entailment Sentence 1: The boy skated down the staircase railing. Sentence 2: The boy is a newbie skater. Label: contradiction Sentence 1: Two middle-aged people stand by a golf hole. Sentence 2: A couple riding in a golf cart. Label:

Feb 03 16:16:37.377 [INFO] Peer 12D3KooWJALV7xRuHLzJHAftZhmSeqz68hywh1oK8oYmW844vWHt did not respond, banning it temporarily
Feb 03 16:16:37.377 [WARN] [/home/gene/dockerx/temp/petals/src/petals/client/inference_session.py.step:311] Caught exception when running inference from block 16 (retry in 0 sec): TimeoutError()
Feb 03 16:16:37.378 [WARN] [/home/gene/dockerx/temp/petals/src/petals/client/routing/sequence_manager.py.make_sequence:109] Remote SequenceManager is still searching for routes, waiting for it to become ready
Feb 03 16:17:10.908 [INFO] Peer 12D3KooWJALV7xRuHLzJHAftZhmSeqz68hywh1oK8oYmW844vWHt did not respond, banning it temporarily
Feb 03 16:17:10.908 [WARN] [/home/gene/dockerx/temp/petals/src/petals/client/inference_session.py.step:311] Caught exception when running inference from block 16 (retry in 1 sec): TimeoutError()
Feb 03 16:17:11.909 [WARN] [/home/gene/dockerx/temp/petals/src/petals/client/routing/sequence_manager.py.make_sequence:109] Remote SequenceManager is still searching for routes, waiting for it to become ready
borzunov commented 1 year ago

Hi @gururise!

Thanks for reporting. One hypothesis is that the default client-side timeout may be too low. Can you please add request_timeout=300 when you create the model (here) and try again? This will set the timeout to 5 min instead of 30 sec (default).

If it helps, we'll consider increasing the default timeout. Otherwise, we'll go on with investigating.

slush0 commented 1 year ago

Hello, I was digging deep into this problem on my side and I can confirm this is happening to me, too. Thank @gururise for describing that in detail and making the issue reproducible!

Originally I though it is some networking issue on my side (and I discussed this on Discord few days ago), but later I realized it is something inside Hivemind/Petals itself.

I'm actually falling into this "timeout errors" from my own node, when I'm running the inference script on the same machine. I can confirm this happen even on under-utilized machine (no CPU load, plenty of RAM) and without any networking issue at the moment of timeouts.

gururise commented 1 year ago

Hi @gururise!

Thanks for reporting. One hypothesis is that the default client-side timeout may be too low. Can you please add request_timeout=300 when you create the model (here) and try again? This will set the timeout to 5 min instead of 30 sec (default).

If it helps, we'll consider increasing the default timeout. Otherwise, we'll go on with investigating.

Increasing the request_timeout=300 did resolve the issue for the 2nd prompt I gave in my first post.

However, by increasing the prompt to 345 words, I can almost always cause inference to timeout. Using this prompt below, a request_timeout of 300 is not large enough:

Africa is a vast continent, with 54 countries. Although some confuse the entire continent with being a single country. Africa is home to the largest land animal in the world – the African elephant. Africa is the most centrally-located continent on the planet. Both the equator and the Greenwich Meridian line cross it. The largest African country is Algeria. Africa holds the name of being the biggest oil producer in the world. As well as the fastest animal in the world – the cheetah. The world’s largest desert (Sahara) is also situated in Africa. Africa’s largest island is Madagascar. Africa is home to 25% of the world’s bird species. There are over 2500 kinds of birds found throughout its countries. Africa is the world’s hottest continent, with a town in Ethiopia seeing average temperatures of 33.9 °C throughout the year. Four of the five fastest land animals can be found in East Africa. These are the lion, the gazelle, the wildebeest, and the cheetah. The Sahara desert is currently larger than the entire United States. And it continues to grow each year! Inside the country of South Africa is a smaller, landlocked country called Lesotho. The only African countries that weren’t colonized by Europeans were Ethiopia and Liberia. The smallest country in Africa is the Seychelles, which is also an island. It’s also home to the tallest animal in the world, the giraffe. South Africa, officially the Republic of South Africa (RSA) is the southernmost country in Africa. It has an area of 1,219,090 square km. Its capital is Pretoria and largest city is Johannesburg. Zulu, Xhosa, Afrikaans, English, Tsonga, Swazi, and Venda are some of its official languages. Its official currency is South African rand (ZAR). Six countries that share land borders with South Africa are Botswana, Mozambique, Namibia, Swaziland, Lesotho and Zimbabwe. South Africa is a multiethnic society encompassing a wide variety of cultures, languages, and religions.

Question: Where is the world's largest desert? Answer: The Sahara desert is the world's largest desert. Question: Where was the temperatures of 33.9 °C measured? Answer:

Feb 06 13:00:45.865 [INFO] Peer 12D3KooWRftAHGeKyYmq35tn5Daqiu4D9767xUfZJX7E6LH2yKs9 did not respond, banning it temporarily
Feb 06 13:00:45.865 [WARN] [/home/gene/dockerx/bloom/llmvenv/lib/python3.10/site-packages/petals/client/inference_session.py.step:311] Caught exception when running inference from block 16 (retry in 0 sec): ConnectionResetError(104, 'Connection reset by peer')
Feb 06 13:00:45.910 [WARN] [/home/gene/dockerx/bloom/llmvenv/lib/python3.10/site-packages/petals/client/routing/sequence_manager.py.make_sequence:109] Remote SequenceManager is still searching for routes, waiting for it to become ready
Feb 06 13:00:45.910 [INFO] Peer 12D3KooWPfbFqvns4caiPKEPoChchujBTJqLN6KC8CJBSkPjUDL5 did not respond, banning it temporarily
Feb 06 13:00:45.910 [WARN] [/home/gene/dockerx/bloom/llmvenv/lib/python3.10/site-packages/petals/client/inference_session.py.step:311] Caught exception when running inference from block 54 (retry in 0 sec): ConnectionResetError(104, 'Connection reset by peer')
Feb 06 13:00:45.910 [WARN] [/home/gene/dockerx/bloom/llmvenv/lib/python3.10/site-packages/petals/client/routing/sequence_manager.py.make_sequence:109] Remote SequenceManager is still searching for routes, waiting for it to become ready

I don't know if there is a better solution; however, thank you for pointing out the request_timeout parameter. At least I can probably dynamically compute a value high enough based upon the number of tokens in the input prompt.

borzunov commented 1 year ago

@gururise,

Thanks for trying it out. I guess you can just use a very large request_timeout, e.g., request_timeout=1800 (that's the max supported value). We'll think about using the dynamically computed value by default in future releases.

An alternative is to process the prompt with a smaller timeouts chunk-by-chunk. In the latter case, you may need to fix .generate() so that it works with the max_new_tokens=0 argument, or just implement the inference loop yourself, as we do in Step 5 of our tutorial notebook.

gururise commented 1 year ago

@gururise,

Thanks for trying it out. I guess you can just use a very large request_timeout, e.g., request_timeout=1800 (that's the max supported value). We'll think about using the dynamically computed value by default in future releases.

An alternative is to process the prompt with a smaller timeouts chunk-by-chunk. In the latter case, you may need to fix .generate() so that it works with the max_new_tokens=0 argument, or just implement the inference loop yourself, as we do in Step 5 of our tutorial notebook.

Thanks for the tip! Appreciate it!

borzunov commented 1 year ago

@gururise @slush0 Follow-up: we did increase the default request_timeout from 30 sec to 3 min in #276. This should address TimeoutErrors that happened while running inference with a large prefix or fine-tuning with a large batch.