This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.56k
stars
2.78k
forks
source link
azure-ai-generative[synthetic] Long-running simulator job fails if AOAI connection reset by peer #34961
Operating System: Linux aml-ci 5.15.0-1040-azure 47~20.04.1-Ubuntu SMP Fri Jun 2 21:38:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Python Version: 3.10.13
Describe the bug
A long-running an instance of azure.ai.generative.synthetic.simulator.Simulator.simulate_asyc(...) with 1000s of chat templates (mock user profiles) against AOAI GPT-4 dies if the underlying aiohttp-based connection is reset by peer.
To Reproduce
Difficult to reproduce outside of the customer's environment, and unclear which peer caused the reset (AML CI running this package, or the AOAI GPT-4 endpoint).
Expected behavior
I expected this exception to be caught, logged, and the connection re-tried. Instead, the Simulator.simulate_asyc(...) call terminates early, and fails to complete all conversations.
Screenshots
Stacktrace to the exception which caused Simulator.simulate_asyc(...) to terminate early:
...
File "/anaconda/envs/cba2/lib/python3.10/site-packages/azure/ai/generative/synthetic/simulator/_conversation/conversation.py", line 61, in simulate_conversation
(first_response, request, _, full_response) = await bots[0].generate_response(
File "/anaconda/envs/cba2/lib/python3.10/site-packages/azure/ai/generative/synthetic/simulator/_conversation/conversation_bot.py", line 142, in generate_response
response = await self.model.get_conversation_completion(
File "/anaconda/envs/cba2/lib/python3.10/site-packages/azure/ai/generative/synthetic/simulator/_model_tools/models.py", line 586, in get_conversation_completion
return await self.request_api(
File "/anaconda/envs/cba2/lib/python3.10/site-packages/azure/ai/generative/synthetic/simulator/_model_tools/models.py", line 495, in request_api
async with session.post(
File "/anaconda/envs/cba2/lib/python3.10/site-packages/aiohttp_retry/client.py", line 149, in __aenter__
return await self._do_request()
File "/anaconda/envs/cba2/lib/python3.10/site-packages/aiohttp_retry/client.py", line 138, in _do_request
raise e
File "/anaconda/envs/cba2/lib/python3.10/site-packages/aiohttp_retry/client.py", line 100, in _do_request
response: ClientResponse = await self._request_func(
File "/anaconda/envs/cba2/lib/python3.10/site-packages/aiohttp/client.py", line 605, in _request
await resp.start(conn)
File "/anaconda/envs/cba2/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 966, in start
message, payload = await protocol.read() # type: ignore[union-attr]
File "/anaconda/envs/cba2/lib/python3.10/site-packages/aiohttp/streams.py", line 622, in read
await self._waiter
aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer
Additional context
We (ISE) are currently building an evaluation framework using this package with a customer to stress-test their new AOAI-based CoPilot chat-bot. Early termination of the simulator means the pipeline to generate their synthetic evaluation data is breaking.
Describe the bug A long-running an instance of
azure.ai.generative.synthetic.simulator.Simulator.simulate_asyc(...)
with 1000s of chat templates (mock user profiles) against AOAI GPT-4 dies if the underlyingaiohttp
-based connection is reset by peer.To Reproduce Difficult to reproduce outside of the customer's environment, and unclear which peer caused the reset (AML CI running this package, or the AOAI GPT-4 endpoint).
Expected behavior I expected this exception to be caught, logged, and the connection re-tried. Instead, the
Simulator.simulate_asyc(...)
call terminates early, and fails to complete all conversations.Screenshots
Stacktrace to the exception which caused
Simulator.simulate_asyc(...)
to terminate early:Additional context We (ISE) are currently building an evaluation framework using this package with a customer to stress-test their new AOAI-based CoPilot chat-bot. Early termination of the simulator means the pipeline to generate their synthetic evaluation data is breaking.
@xiaolul