Open RyanMarten opened 2 hours ago
Also check the logic here....
now if we are iteratively writing batch_objects.jsonl, just checking that it exists doesn't mean we can skip batch submission (we need to instead resume by looking at the metadata of which batches have been submitted and which ones haven't)
I'm seeing a bunch of the timeouts ^ as we try to upload all the batches: [36m(_Completions pid=125780, ip=10.120.0.7)[0m INFO:openai._base_client:Retrying request to /files in 0.839617 seconds [36m(_Completions pid=125780, ip=10.120.0.7)[0m INFO:openai._base_client:Retrying request to /files in 0.884488 seconds [36m(_Completions pid=125780, ip=10.120.0.7)[0m INFO:openai._base_client:Retrying request to /files in 0.821620 seconds
This is what killed the job openai.APITimeoutError: Request timed out. raise APITimeoutError(request=request) from err File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/8f9a6c08a6f7b36cef5b248cf848c00d3b8e4aef/virtualenv/lib/python3.10/site-packages/openai/_base_client.py", line 1591, in _request return await self._retry_request( File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/8f9a6c08a6f7b36cef5b248cf848c00d3b8e4aef/virtualenv/lib/python3.10/site-packages/openai/_base_client.py", line 1581, in _request return await self._retry_request( File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/8f9a6c08a6f7b36cef5b248cf848c00d3b8e4aef/virtualenv/lib/python3.10/site-packages/openai/_base_client.py", line 1581, in _request return await self._request( File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/8f9a6c08a6f7b36cef5b248cf848c00d3b8e4aef/virtualenv/lib/python3.10/site-packages/openai/_base_client.py", line 1533, in request return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls) File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/8f9a6c08a6f7b36cef5b248cf848c00d3b8e4aef/virtualenv/lib/python3.10/site-packages/openai/_base_client.py", line 1839, in post return await self._post( File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/8f9a6c08a6f7b36cef5b248cf848c00d3b8e4aef/virtualenv/lib/python3.10/site-packages/openai/resources/files.py", line 443, in create batch_file_upload = await async_client.files.create(
There are 100+ batches successfully submitted and in the dashboard…. I wonder if we can recover and just use that.
When this happens, no batch_objects.jsonl. In the cache. This should be written during the batch creation. So if some of the batch submission fails then we still have partial and can use that