aws-samples / aws-genai-llm-chatbot

A modular and comprehensive solution to deploy a Multi-LLM and Multi-RAG powered chatbot (Amazon Bedrock, Anthropic, HuggingFace, OpenAI, Meta, AI21, Cohere, Mistral) using AWS CDK on AWS
https://aws-samples.github.io/aws-genai-llm-chatbot/
MIT No Attribution
1.04k stars 311 forks source link

Error during upload of files in the Workspace with Aurora Serverless Workspace Engine #185

Closed alexeyshishkin01 closed 8 months ago

alexeyshishkin01 commented 10 months ago

How can I troubleshoot the errors during of files in the Workspace with Aurora Serverless Workspace Engine ? Currently I cannot upload pdf/json files. I tried both sentence-transformers/all-MiniLM-L6-v2 and intfloat/multilingual-e5-large models.

alexeyshishkin01 commented 10 months ago

the same situation with amazon.titan-embed-text-v1 model

bigadsoleiman commented 10 months ago

We are adding documentation to this repo to faciliate such debugging.

In the meantime, file ingestions are handled via Step functions which triggers AWS Batch Jobs

The first step is to look out for a step function prefixed with RagEnginesDataImportFileImportWorkflow - in there you should see current and past job along with their status. To dig into jobs logs you can open the Batch job and its log

1 2

alexeyshishkin01 commented 10 months ago

thanks, approximately how long will it take to make a fix for this issue ?

alexeyshishkin01 commented 10 months ago

just fyi - I used the following pdf-file for my upload tests - https://docs.aws.amazon.com/bedrock/latest/userguide/bedrock-ug.pdf

alexeyshishkin01 commented 10 months ago

I have also tested pdf-files upload with Opensearch - it also fails So far, pdf-files upload fails both for Aurora and Opensearch and works well for Kendra

Gmreh commented 10 months ago

I also experienced the same, that most file uploads fail and even those that got successfully updated seem to be not really available for the models because the answer to even general questions to the context is "I don't know". I have to deep dive it but would be nice to have some more info about the workflow and reasons for the fails.

Gmreh commented 10 months ago

When the upload process is successful and I ask the models about content it recognized the pdf but says "I dont have access to the content of the file". Did not find the reason yet, help appreciated. regards

alexeyshishkin01 commented 10 months ago

Here are my findings:

AWS Console -> Step Functions -> State machines -> 'RagEnginesDataImportFileImportWorkflowFileImportStateMachineXXXXXXX-YYYYYYYYYYYY' -> Executions -> choose your Execution In the 'Graph View' choose failed step 'FileImportJob' -> on the right of it click on 'Batch job' link (just under FileImportJob) and you'll be transfered to AWS Batch In AWS Batch screen under 'Job Information' click on 'Log stream name' RagEnginesDataImportFil-xxx and you'll be transfered to AWS CloudWatch

1) upload file - https://github.com/aws-samples/aws-genai-llm-chatbot/blob/main/cdk.json - the result is failed

from AWS CloudWatch:

2) upload file - https://docs.aws.amazon.com/bedrock/latest/userguide/bedrock-ug.pdf - the result is failed

from AWS CloudWatch:

alexeyshishkin01 commented 10 months ago

3) upload file - https://github.com/aws-samples/aws-genai-llm-chatbot/blob/main/README.md - the result is success

from AWS CloudWatch:

alexeyshishkin01 commented 10 months ago

I have successfully uploaded the following files:

4. https://raw.githubusercontent.com/zalandoresearch/feidegger/master/data/FEIDEGGER_release_1.2.json

5. https://github.com/datablist/sample-csv-files/raw/main/files/customers/customers-100.csv

6. https://raw.githubusercontent.com/dotnet/docs/main/docs/standard/linq/sample-xml-file-typical-purchase-order.md

ntsaini commented 10 months ago

Just FYI, I had a similar issue with the PDF file, I exported the PDF as a docx and then tried to upload. It processed fine. Not sure exactly what is wrong but it could be a temporary workaround in case you are running in the same issue.

QuinnGT commented 10 months ago

Yeah it does the same for me with certain pdf and ppt files. Out of 300ish docs I've uploaded, it's happened on 11 of them. The common error is calculate_intersection_area(element_bbox, annotation["bbox"]) ZeroDivisionError: float division by zero

Since it's using docs = loader.load() from langchain S3fileloader, I don't know that there are many other options in that function. May need to swap to something else.

alexeyshishkin01 commented 10 months ago

May I ask you to look at the https://github.com/aws-samples/aurora-postgresql-pgvector/tree/main/apgpgvector-streamlit ?

I used to upload pdf-files with this and have not got problems so far with the same pdf-files I have issues here.

QuinnGT commented 9 months ago

I found the error. It's due to using an older version of Unstructured. It was fixed in version 0.10.20

We should be on 0.11.2 anyways, so I'll test the change and push the fix soon.

alexeyshishkin01 commented 9 months ago

this is really good news ! how long can it take for you to get the fix and test it (approximately) ?