azure-ai-generative[synthetic] Long-running simulator job fails if AOAI connection reset by peer

Package Name: azure-ai-generative[synthetic]
Package Version: 1.0.0b8 (built from main)
Operating System: Linux aml-ci 5.15.0-1040-azure 47~20.04.1-Ubuntu SMP Fri Jun 2 21:38:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Python Version: 3.10.13

Describe the bug A long-running an instance of azure.ai.generative.synthetic.simulator.Simulator.simulate_asyc(...) with 1000s of chat templates (mock user profiles) against AOAI GPT-4 dies if the underlying aiohttp-based connection is reset by peer.

To Reproduce Difficult to reproduce outside of the customer's environment, and unclear which peer caused the reset (AML CI running this package, or the AOAI GPT-4 endpoint).

Expected behavior I expected this exception to be caught, logged, and the connection re-tried. Instead, the Simulator.simulate_asyc(...) call terminates early, and fails to complete all conversations.

Screenshots

Stacktrace to the exception which caused Simulator.simulate_asyc(...) to terminate early:

...
File "/anaconda/envs/cba2/lib/python3.10/site-packages/azure/ai/generative/synthetic/simulator/_conversation/conversation.py", line 61, in simulate_conversation
    (first_response, request, _, full_response) = await bots[0].generate_response(
  File "/anaconda/envs/cba2/lib/python3.10/site-packages/azure/ai/generative/synthetic/simulator/_conversation/conversation_bot.py", line 142, in generate_response
    response = await self.model.get_conversation_completion(
  File "/anaconda/envs/cba2/lib/python3.10/site-packages/azure/ai/generative/synthetic/simulator/_model_tools/models.py", line 586, in get_conversation_completion
    return await self.request_api(
  File "/anaconda/envs/cba2/lib/python3.10/site-packages/azure/ai/generative/synthetic/simulator/_model_tools/models.py", line 495, in request_api
    async with session.post(
  File "/anaconda/envs/cba2/lib/python3.10/site-packages/aiohttp_retry/client.py", line 149, in __aenter__
    return await self._do_request()
  File "/anaconda/envs/cba2/lib/python3.10/site-packages/aiohttp_retry/client.py", line 138, in _do_request
    raise e
  File "/anaconda/envs/cba2/lib/python3.10/site-packages/aiohttp_retry/client.py", line 100, in _do_request
    response: ClientResponse = await self._request_func(
  File "/anaconda/envs/cba2/lib/python3.10/site-packages/aiohttp/client.py", line 605, in _request
    await resp.start(conn)
  File "/anaconda/envs/cba2/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 966, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
  File "/anaconda/envs/cba2/lib/python3.10/site-packages/aiohttp/streams.py", line 622, in read
    await self._waiter
aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer

Additional context We (ISE) are currently building an evaluation framework using this package with a customer to stress-test their new AOAI-based CoPilot chat-bot. Early termination of the simulator means the pipeline to generate their synthetic evaluation data is breaking.

@xiaolul

Azure / azure-sdk-for-python

azure-ai-generative[synthetic] Long-running simulator job fails if AOAI connection reset by peer #34961