Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
https://azure.microsoft.com/products/search
MIT License
6.02k stars 4.12k forks source link

During prepdocs.py, openai.BadRequestError: Error code: 400 - {'error': {'message': "'$.input' is invalid. Please check the API reference... #1134

Open Mshz2 opened 9 months ago

Mshz2 commented 9 months ago

This issue is for a:

- [x] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

During Data Ingestion by running command of bash ./scripts/prepdocs.sh

Any log messages given by the failure

Converting page 27 to image and uploading -> document.png
Batch Completed. Batch size  13 Token count 8077
Traceback (most recent call last):
File "/home/azure-search-openai-demo/./scripts/prepdocs.py", line 310, in <module>
loop.run_until_complete(main(file_strategy, azd_credential, args))
File "/home/anaconda3/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/home/azure-search-openai-demo/./scripts/prepdocs.py", line 160, in main
await strategy.run(search_info)
File "/home/azure-search-openai-demo/scripts/prepdocslib/filestrategy.py", line 76, in run
await search_manager.update_content(sections, blob_image_embeddings)
File "/home/azure-search-openai-demo/scripts/prepdocslib/searchmanager.py", line 170, in update_content
embeddings = await self.embeddings.create_embeddings(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/azure-search-openai-demo/scripts/prepdocslib/embeddings.py", line 118, in create_embeddings
return await self.create_embedding_batch(texts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/azure-search-openai-demo/scripts/prepdocslib/embeddings.py", line 89, in create_embedding_batch
async for attempt in AsyncRetrying(
File "/home/azure-search-openai-demo/scripts/.venv/lib/python3.11/site-packages/tenacity/_asyncio.py", line 71, in __anext__
do = self.iter(retry_state=self._retry_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/azure-search-openai-demo/scripts/.venv/lib/python3.11/site-packages/tenacity/__init__.py", line 314, in iter
return fut.result()
^^^^^^^^^^^^
File "/home/anaconda3/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/home/anaconda3/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/home/azureuser/maintenance/azure-search-openai-demo/scripts/prepdocslib/embeddings.py", line 96, in create_embedding_batch
emb_response = await client.embeddings.create(model=self.open_ai_model_name, input=batch.texts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/azure-search-openai-demo/scripts/.venv/lib/python3.11/site-packages/openai/resources/embeddings.py", line 198, in create
return await self._post(
^^^^^^^^^^^^^^^^^
File "/home/azure-search-openai-demo/scripts/.venv/lib/python3.11/site-packages/openai/_base_client.py", line 1542, in post
return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/azure-search-openai-demo/scripts/.venv/lib/python3.11/site-packages/openai/_base_client.py", line 1316, in request
return await self._request(
^^^^^^^^^^^^^^^^^^^^
File "/home/azure-search-openai-demo/scripts/.venv/lib/python3.11/site-packages/openai/_base_client.py", line 1368, in _request
raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'error': {'message': "'$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?) Ubuntu 20.04

azd version?

run azd version and copy paste here. azd version 1.5.1

Versions

Python 3.10.13

mattgotteiner commented 9 months ago

HI @Mshz2 ,

Thanks for reporting this issue. Are you using the sample data or your own custom data set? Are you using open ai embeddings, or azure open ai embeddings?

Mshz2 commented 9 months ago

@mattgotteiner Hi, thanks for the reply. I am using my own PDFs. Some of them can have above 100 pages. I'm using azure openai.

mattgotteiner commented 9 months ago

thanks - it's possible a single document is causing this error. we'll have to file a follow-up issue to skip documents that have this error and then you can retry

singloudly90 commented 7 months ago

@mattgotteiner I also having the same issue. some documents are fine.. wondering what kind document will cause this issue raise self._make_status_error_from_response(err.response) from None openai.BadRequestError: Error code: 400 - {'error': {'message': "'$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

tchmahe commented 7 months ago

I have the same problem. This may depend on the switch to the new client lib azure-ai-documentintelligence==1.0.0b1. With old releases or local parser it works.

czumbiehl commented 7 months ago

HAve the same problem. Did someone find the solution? Can't use some of my pdf files. Changing the version of Azure did not worK. Could it be if there are blank spaces at the beginning of the pdf? Appreciate help ! Thanks a lot

drajinvites82 commented 6 months ago

I do have the same issue and it only happens for selective files. I am not sure if the images or tables in the file are causing the issue but there is no specific pattern between the files. Was anyone able to resolve the issue and any tips would be helpful!

drajinvites82 commented 6 months ago

Here is the update on this issue for everyone. The below solution from pamelafox works,

Update: This error is happening when we pass a text of length 0 (an empty string) to the batch embeddings API. The single embedding API is fine with that input, but the batch embedding API is not. (See https://github.com/openai/openai-python/issues/576)

Now, I don't know yet why we have sections that have 0 text in them, as I don't expect that to happen in most cases (possibly for GPT4-vision, but this occurs with vision disabled as well). I'm going to ask @tonybaloney to see if it was related to the recent splitting change.

As another workaround, you can put this code in create_embedding_batch:

replace any empty strings with whitespace for now

batch.texts = [text if text else " " for text in batch.texts] emb_response = await client.embeddings.create(model=self.open_ai_model_name, input=batch.texts) The batch embedding endpoint seems fine with a whitespace string.

https://github.com/Azure-Samples/azure-search-openai-demo/issues/1415

tonybaloney commented 6 months ago

There was a proper fix in the code base about 2 weeks ago, please pull/download the latest release