Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
https://azure.microsoft.com/products/search
MIT License
5.86k stars 3.99k forks source link

Running Prepdocs.sh in ADO Pipeline does not mirror development behaviour (instead of using ADLS as source it uses "local files") #1541

Closed DSOTM-RSA closed 4 months ago

DSOTM-RSA commented 4 months ago

This issue is for a: (mark with an x)

- [x] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Set up ADLS as source for documents, include defining required (and optional) environment variables; AZURE_ADLS_GEN2_FILESYSTEM, AZURE_ADLS_GEN2_FILESYSTEM_PATH, AZURE_ADLS_GEN2_STORAGE_ACCOUNT, and AZURE_DATALAKE_KEY. Locally after running either azd auth login or azd auth login --client_id --tenant-id --client-secret, run scripts/prepdocs.sh. This successfully processes files in ADLS and adds them to the search index. prepdocs.sh is defined as follows;

 #!/bin/sh
. ./scripts/loadenv.sh

echo 'Running "prepdocs.py"'

if [ -n "$AZURE_ADLS_GEN2_STORAGE_ACCOUNT" ]; then
  adlsGen2StorageAccountArg="--datalakestorageaccount $AZURE_ADLS_GEN2_STORAGE_ACCOUNT"
  adlsGen2FilesystemPathArg=""
  if [ -n "$AZURE_ADLS_GEN2_FILESYSTEM_PATH" ]; then
    adlsGen2FilesystemPathArg="--datalakepath $AZURE_ADLS_GEN2_FILESYSTEM_PATH"
  fi
  adlsGen2FilesystemArg=""
  if [ -n "$AZURE_ADLS_GEN2_FILESYSTEM" ]; then
    adlsGen2FilesystemArg="--datalakefilesystem $AZURE_ADLS_GEN2_FILESYSTEM"
  fi
  aclArg="--useacls"
fi

if [ -n "$AZURE_SEARCH_ANALYZER_NAME" ]; then
  searchAnalyzerNameArg="--searchanalyzername $AZURE_SEARCH_ANALYZER_NAME"
fi

if [ -n "$AZURE_USE_AUTHENTICATION" ]; then
  aclArg="--useacls"
fi

visionEndpointArg=""
if [ -n "$AZURE_VISION_ENDPOINT" ]; then
  visionEndpointArg="--visionendpoint $AZURE_VISION_ENDPOINT"
fi

keyVaultName=""
if [ -n "$AZURE_KEY_VAULT_NAME" ]; then
  keyVaultName="--keyvaultname $AZURE_KEY_VAULT_NAME"
fi

searchSecretNameArg=""
if [ -n "$AZURE_SEARCH_SECRET_NAME" ]; then
  searchSecretNameArg="--searchsecretname $AZURE_SEARCH_SECRET_NAME"
fi

if [ "$USE_GPT4V" = true ]; then
  searchImagesArg="--searchimages"
fi

if [ "$USE_VECTORS" = false ]; then
  disableVectorsArg="--novectors"
fi

if [ -n "$AZURE_OPENAI_EMB_DIMENSIONS" ]; then
  openAiDimensionsArg="--openaidimensions $AZURE_OPENAI_EMB_DIMENSIONS"
fi

if [ "$USE_LOCAL_PDF_PARSER" = true ]; then
  localPdfParserArg="--localpdfparser"
fi

if [ "$USE_LOCAL_HTML_PARSER" = true ]; then
  localHtmlParserArg="--localhtmlparser"
fi

if [ -n "$AZURE_TENANT_ID" ]; then
  tenantArg="--tenantid $AZURE_TENANT_ID"
fi

if [ -n "$AZURE_DATALAKE_KEY" ]; then
  datalakeArg="--datalakekey $AZURE_DATALAKE_KEY"
fi

if [ -n "$USE_FEATURE_INT_VECTORIZATION" ]; then
  integratedVectorizationArg="--useintvectorization $USE_FEATURE_INT_VECTORIZATION"
fi

./scripts/.venv/bin/python ./scripts/prepdocs.py './data/*' --verbose \
--subscriptionid "$AZURE_SUBSCRIPTION_ID"  \
--storageaccount "$AZURE_STORAGE_ACCOUNT" --container "$AZURE_STORAGE_CONTAINER" --storageresourcegroup "$AZURE_STORAGE_RESOURCE_GROUP" \
--searchservice "$AZURE_SEARCH_SERVICE" --index "$AZURE_SEARCH_INDEX" \
$searchAnalyzerNameArg $searchSecretNameArg \
--openaihost "$OPENAI_HOST" --openaimodelname "$AZURE_OPENAI_EMB_MODEL_NAME" $openAiDimensionsArg \
--openaiservice "$AZURE_OPENAI_SERVICE" --openaideployment "$AZURE_OPENAI_EMB_DEPLOYMENT"  \
--openaikey "$OPENAI_API_KEY" --openaiorg "$OPENAI_ORGANIZATION" \
--documentintelligenceservice "$AZURE_DOCUMENTINTELLIGENCE_SERVICE" \
$searchImagesArg $visionEndpointArg \
$adlsGen2StorageAccountArg $adlsGen2FilesystemArg $adlsGen2FilesystemPathArg \
$tenantArg $aclArg \
$disableVectorsArg $localPdfParserArg $localHtmlParserArg \
$keyVaultName \
$integratedVectorizationArg \
$datalakeArg

image

Any log messages given by the failure

2024-04-16T14:47:48.9656027Z ##[debug]Evaluating condition for step: 'Bash' 2024-04-16T14:47:48.9657445Z ##[debug]Evaluating: SucceededNode() 2024-04-16T14:47:48.9657759Z ##[debug]Evaluating SucceededNode: 2024-04-16T14:47:48.9658617Z ##[debug]=> True 2024-04-16T14:47:48.9659044Z ##[debug]Result: True 2024-04-16T14:47:48.9659322Z ##[section]Starting: Bash 2024-04-16T14:47:48.9665442Z ============================================================================== 2024-04-16T14:47:48.9665642Z Task : Bash 2024-04-16T14:47:48.9665703Z Description : Run a Bash script on macOS, Linux, or Windows 2024-04-16T14:47:48.9665788Z Version : 3.237.1 2024-04-16T14:47:48.9665855Z Author : Microsoft Corporation 2024-04-16T14:47:48.9665991Z Help : https://docs.microsoft.com/azure/devops/pipelines/tasks/utility/bash 2024-04-16T14:47:48.9666091Z ============================================================================== 2024-04-16T14:47:49.0638100Z ##[debug]Using node path: /home/vsts/agents/3.238.0/externals/node20_1/bin/node 2024-04-16T14:47:49.1489107Z ##[debug]agent.TempDirectory=/home/vsts/work/_temp 2024-04-16T14:47:49.1502398Z ##[debug]loading inputs and endpoints 2024-04-16T14:47:49.1531128Z ##[debug]loading INPUT_TARGETTYPE 2024-04-16T14:47:49.1531478Z ##[debug]loading INPUT_FILEPATH 2024-04-16T14:47:49.1531763Z ##[debug]loading INPUT_SCRIPT 2024-04-16T14:47:49.1532044Z ##[debug]loading INPUT_WORKINGDIRECTORY 2024-04-16T14:47:49.1532329Z ##[debug]loading INPUT_FAILONSTDERR 2024-04-16T14:47:49.1532623Z ##[debug]loading ENDPOINT_AUTH_SYSTEMVSSCONNECTION 2024-04-16T14:47:49.1532932Z ##[debug]loading ENDPOINT_AUTH_SCHEME_SYSTEMVSSCONNECTION 2024-04-16T14:47:49.1533783Z ##[debug]loading ENDPOINT_AUTH_PARAMETER_SYSTEMVSSCONNECTION_ACCESSTOKEN 2024-04-16T14:47:49.1542346Z ##[debug]loading SECRET_SYSTEM_ACCESSTOKEN 2024-04-16T14:47:49.1545909Z ##[debug]loaded 9 2024-04-16T14:47:49.1551989Z ##[debug]Agent.ProxyUrl=undefined 2024-04-16T14:47:49.1553697Z ##[debug]Agent.CAInfo=undefined 2024-04-16T14:47:49.1554922Z ##[debug]Agent.ClientCert=undefined 2024-04-16T14:47:49.1555347Z ##[debug]Agent.SkipCertValidation=undefined 2024-04-16T14:47:49.1577044Z ##[debug]check path : /home/vsts/work/_tasks/Bash_6c731c3c-3c68-459a-a5c9-bde6e6595b5b/3.237.1/task.json 2024-04-16T14:47:49.1580498Z ##[debug]adding resource file: /home/vsts/work/_tasks/Bash_6c731c3c-3c68-459a-a5c9-bde6e6595b5b/3.237.1/task.json 2024-04-16T14:47:49.1580850Z ##[debug]system.culture=en-US 2024-04-16T14:47:49.1606438Z ##[debug]failOnStderr=false 2024-04-16T14:47:49.1606760Z ##[debug]workingDirectory=/home/vsts/work/1/s 2024-04-16T14:47:49.1607055Z ##[debug]check path : /home/vsts/work/1/s 2024-04-16T14:47:49.1607333Z ##[debug]targetType=filePath 2024-04-16T14:47:49.1607591Z ##[debug]bashEnvValue=undefined 2024-04-16T14:47:49.1607883Z ##[debug]filePath=/home/vsts/work/1/s/scripts/prepdocs.sh 2024-04-16T14:47:49.1608173Z ##[debug]arguments=undefined 2024-04-16T14:47:49.1614601Z Generating script. 2024-04-16T14:47:49.1619607Z ##[debug]which 'bash' 2024-04-16T14:47:49.1626695Z ##[debug]found: '/usr/bin/bash' 2024-04-16T14:47:49.1627913Z ##[debug]Feature flag AZP_75787_ENABLE_NEW_LOGIC_LOG = false 2024-04-16T14:47:49.1628665Z ##[debug]Feature flag AZP_75787_ENABLE_NEW_LOGIC = false 2024-04-16T14:47:49.1629367Z ##[debug]Feature flag AZP_75787_ENABLE_COLLECT = true 2024-04-16T14:47:49.1630168Z ##[debug]Validating file args... 2024-04-16T14:47:49.1632831Z ##[debug]Expanded file args: 2024-04-16T14:47:49.1634990Z Formatted command: exec bash '/home/vsts/work/1/s/scripts/prepdocs.sh' 2024-04-16T14:47:49.1635977Z ##[debug]Agent.Version=3.238.0 2024-04-16T14:47:49.1637221Z ##[debug]agent.tempDirectory=/home/vsts/work/_temp 2024-04-16T14:47:49.1637522Z ##[debug]check path : /home/vsts/work/_temp 2024-04-16T14:47:49.1639528Z ========================== Starting Command Output =========================== 2024-04-16T14:47:49.1640797Z ##[debug]which '/usr/bin/bash' 2024-04-16T14:47:49.1641921Z ##[debug]found: '/usr/bin/bash' 2024-04-16T14:47:49.1643778Z ##[debug]/usr/bin/bash arg: /home/vsts/work/_temp/27cacee6-ef2e-4b32-aa61-76c30d6fc288.sh 2024-04-16T14:47:49.1714232Z ##[debug]exec tool: /usr/bin/bash 2024-04-16T14:47:49.1714707Z ##[debug]arguments: 2024-04-16T14:47:49.1715010Z ##[debug] /home/vsts/work/_temp/27cacee6-ef2e-4b32-aa61-76c30d6fc288.sh 2024-04-16T14:47:49.1715597Z [command]/usr/bin/bash /home/vsts/work/_temp/27cacee6-ef2e-4b32-aa61-76c30d6fc288.sh 2024-04-16T14:47:49.1789773Z Loading azd .env file from current environment... 2024-04-16T14:47:49.2055641Z Creating Python virtual environment "scripts/.venv"... 2024-04-16T14:47:53.0988492Z Installing dependencies from "requirements.txt" into virtual environment (in quiet mode)... 2024-04-16T14:47:53.9755457Z ##[debug]Agent environment resources - Disk: / Available 20775.00 MB out of 74244.00 MB, Memory: Used 738.00 MB out of 6921.00 MB, CPU: Usage 45.97% 2024-04-16T14:47:58.9776299Z ##[debug]Agent environment resources - Disk: / Available 20719.00 MB out of 74244.00 MB, Memory: Used 778.00 MB out of 6921.00 MB, CPU: Usage 35.53% 2024-04-16T14:48:03.9809889Z ##[debug]Agent environment resources - Disk: / Available 20606.00 MB out of 74244.00 MB, Memory: Used 810.00 MB out of 6921.00 MB, CPU: Usage 29.03% 2024-04-16T14:48:08.9854241Z ##[debug]Agent environment resources - Disk: / Available 20484.00 MB out of 74244.00 MB, Memory: Used 829.00 MB out of 6921.00 MB, CPU: Usage 24.54% 2024-04-16T14:48:13.9901078Z ##[debug]Agent environment resources - Disk: / Available 20338.00 MB out of 74244.00 MB, Memory: Used 875.00 MB out of 6921.00 MB, CPU: Usage 21.26% 2024-04-16T14:48:18.9950834Z ##[debug]Agent environment resources - Disk: / Available 20284.00 MB out of 74244.00 MB, Memory: Used 877.00 MB out of 6921.00 MB, CPU: Usage 18.78% 2024-04-16T14:48:19.8602096Z Running "prepdocs.py" 2024-04-16T14:48:22.7670281Z Using local files: ./data/* 2024-04-16T14:48:22.7671047Z Ensuring search index exists 2024-04-16T14:48:23.3638337Z Traceback (most recent call last): 2024-04-16T14:48:23.3639100Z File "/usr/lib/python3.10/encodings/idna.py", line 163, in encode 2024-04-16T14:48:23.3639793Z raise UnicodeError("label empty or too long") 2024-04-16T14:48:23.3640083Z UnicodeError: label empty or too long 2024-04-16T14:48:23.3640151Z 2024-04-16T14:48:23.3640432Z The above exception was the direct cause of the following exception: 2024-04-16T14:48:23.3640523Z 2024-04-16T14:48:23.3640657Z Traceback (most recent call last): 2024-04-16T14:48:23.3640821Z File "/home/vsts/work/1/s/./scripts/prepdocs.py", line 494, in 2024-04-16T14:48:23.3641040Z loop.run_until_complete(main(ingestion_strategy, setup_index=not args.remove and not args.removeall)) 2024-04-16T14:48:23.3641286Z File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete 2024-04-16T14:48:23.3641452Z return future.result() 2024-04-16T14:48:23.3641625Z File "/home/vsts/work/1/s/./scripts/prepdocs.py", line 223, in main 2024-04-16T14:48:23.3641775Z await strategy.setup() 2024-04-16T14:48:23.3641953Z File "/home/vsts/work/1/s/scripts/prepdocslib/filestrategy.py", line 74, in setup 2024-04-16T14:48:23.3642137Z await search_manager.create_index() 2024-04-16T14:48:23.3642323Z File "/home/vsts/work/1/s/scripts/prepdocslib/searchmanager.py", line 179, in create_index 2024-04-16T14:48:23.3642566Z if self.search_info.index_name not in [name async for name in search_index_client.list_index_names()]: 2024-04-16T14:48:23.3642794Z File "/home/vsts/work/1/s/scripts/prepdocslib/searchmanager.py", line 179, in 2024-04-16T14:48:23.3643039Z if self.search_info.index_name not in [name async for name in search_index_client.list_index_names()]: 2024-04-16T14:48:23.3643625Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/azure/core/async_paging.py", line 142, in anext 2024-04-16T14:48:23.3643835Z return await self.anext() 2024-04-16T14:48:23.3644161Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/azure/core/async_paging.py", line 145, in anext 2024-04-16T14:48:23.3644376Z self._page = await self._page_iterator.anext() 2024-04-16T14:48:23.3645234Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/azure/core/async_paging.py", line 94, in anext 2024-04-16T14:48:23.3645479Z self._response = await self._get_next(self.continuation_token) 2024-04-16T14:48:23.3645910Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/azure/search/documents/indexes/_generated/aio/operations/_indexes_operations.py", line 292, in get_next 2024-04-16T14:48:23.3646323Z pipeline_response: PipelineResponse = await self._client._pipeline.run( # pylint: disable=protected-access 2024-04-16T14:48:23.3646727Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/azure/core/pipeline/_base_async.py", line 221, in run 2024-04-16T14:48:23.3646941Z return await first_node.send(pipeline_request) 2024-04-16T14:48:23.3647295Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/azure/core/pipeline/_base_async.py", line 69, in send 2024-04-16T14:48:23.3647507Z response = await self.next.send(request) 2024-04-16T14:48:23.3647984Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/azure/core/pipeline/_base_async.py", line 69, in send 2024-04-16T14:48:23.3648194Z response = await self.next.send(request) 2024-04-16T14:48:23.3648542Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/azure/core/pipeline/_base_async.py", line 69, in send 2024-04-16T14:48:23.3648751Z response = await self.next.send(request) 2024-04-16T14:48:23.3648906Z [Previous line repeated 2 more times] 2024-04-16T14:48:23.3649269Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/azure/core/pipeline/policies/_redirect_async.py", line 73, in send 2024-04-16T14:48:23.3649495Z response = await self.next.send(request) 2024-04-16T14:48:23.3649860Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/azure/core/pipeline/policies/_retry_async.py", line 179, in send 2024-04-16T14:48:23.3650083Z response = await self.next.send(request) 2024-04-16T14:48:23.3650819Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/azure/core/pipeline/policies/_authentication_async.py", line 100, in send 2024-04-16T14:48:23.3651057Z response = await self.next.send(request) 2024-04-16T14:48:23.3651389Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/azure/core/pipeline/_base_async.py", line 69, in send 2024-04-16T14:48:23.3651614Z response = await self.next.send(request) 2024-04-16T14:48:23.3651944Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/azure/core/pipeline/_base_async.py", line 69, in send 2024-04-16T14:48:23.3652163Z response = await self.next.send(request) 2024-04-16T14:48:23.3652493Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/azure/core/pipeline/_base_async.py", line 69, in send 2024-04-16T14:48:23.3652721Z response = await self.next.send(request) 2024-04-16T14:48:23.3652873Z [Previous line repeated 2 more times] 2024-04-16T14:48:23.3653212Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/azure/core/pipeline/_base_async.py", line 106, in send 2024-04-16T14:48:23.3653482Z await self._sender.send(request.http_request, *request.context.options), 2024-04-16T14:48:23.3653858Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/azure/core/pipeline/transport/_aiohttp.py", line 294, in send 2024-04-16T14:48:23.3654098Z result = await self.session.request( # type: ignore 2024-04-16T14:48:23.3654423Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/aiohttp/client.py", line 578, in _request 2024-04-16T14:48:23.3654638Z conn = await self._connector.connect( 2024-04-16T14:48:23.3654956Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/aiohttp/connector.py", line 544, in connect 2024-04-16T14:48:23.3655193Z proto = await self._create_connection(req, traces, timeout) 2024-04-16T14:48:23.3655556Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/aiohttp/connector.py", line 911, in _createconnection 2024-04-16T14:48:23.3655884Z , proto = await self._create_direct_connection(req, traces, timeout) 2024-04-16T14:48:23.3656288Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/aiohttp/connector.py", line 1173, in _create_direct_connection 2024-04-16T14:48:23.3656510Z hosts = await asyncio.shield(host_resolved) 2024-04-16T14:48:23.3656860Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/aiohttp/connector.py", line 884, in _resolve_host 2024-04-16T14:48:23.3657117Z addrs = await self._resolver.resolve(host, port, family=self._family) 2024-04-16T14:48:23.3657488Z File "/home/vsts/work/1/s/scripts/.venv/lib/python3.10/site-packages/aiohttp/resolver.py", line 33, in resolve 2024-04-16T14:48:23.3657689Z infos = await self._loop.getaddrinfo( 2024-04-16T14:48:23.3657880Z File "/usr/lib/python3.10/asyncio/base_events.py", line 863, in getaddrinfo 2024-04-16T14:48:23.3658062Z return await self.run_in_executor( 2024-04-16T14:48:23.3658310Z File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run 2024-04-16T14:48:23.3658508Z result = self.fn(self.args, **self.kwargs) 2024-04-16T14:48:23.3658677Z File "/usr/lib/python3.10/socket.py", line 955, in getaddrinfo 2024-04-16T14:48:23.3658888Z for res in _socket.getaddrinfo(host, port, family, type, proto, flags): 2024-04-16T14:48:23.3659227Z UnicodeError: encoding with 'idna' codec failed (UnicodeError: label empty or too long) 2024-04-16T14:48:23.6651495Z 2024-04-16T14:48:23.6652629Z ##[debug]Exit code 1 received from tool '/usr/bin/bash' 2024-04-16T14:48:23.6653550Z ##[debug]STDIO streams have closed for tool '/usr/bin/bash' 2024-04-16T14:48:23.6694985Z ##[error]Bash exited with code '1'. 2024-04-16T14:48:23.6705510Z ##[debug]Processed: ##vso[task.issue type=error;source=TaskInternal;]Bash exited with code '1'. 2024-04-16T14:48:23.6706247Z ##[debug]task result: Failed 2024-04-16T14:48:23.6707583Z ##[debug]Processed: ##vso[task.complete result=Failed;done=true;] 2024-04-16T14:48:23.6709742Z ##[section]Finishing: Bash

Expected/desired behavior

Rather than entering the local_files statement branch process files in ADLS. The secondary issue is of tertiary interest if consistent behaviour can be achieved in ADO.

OS and Version?

Windows 10, VSCode Containter (locally). ADO Pipelines.

azd version?

1.7.0

pamelafox commented 4 months ago

It looks like the ADO pipeline isn't setting the ADLS2 variables correctly. We probably added both the ADO pipeline and ADLS2 variables at a similar time and missed them in the merge. I'll send a PR that hopefully fixes it. Thanks for filing!