Error during upload of files in the Workspace with Aurora Serverless Workspace Engine

alexeyshishkin01 commented 1 year ago

How can I troubleshoot the errors during of files in the Workspace with Aurora Serverless Workspace Engine ? Currently I cannot upload pdf/json files. I tried both sentence-transformers/all-MiniLM-L6-v2 and intfloat/multilingual-e5-large models.

alexeyshishkin01 commented 1 year ago

the same situation with amazon.titan-embed-text-v1 model

bigadsoleiman commented 1 year ago

We are adding documentation to this repo to faciliate such debugging.

In the meantime, file ingestions are handled via Step functions which triggers AWS Batch Jobs

The first step is to look out for a step function prefixed with RagEnginesDataImportFileImportWorkflow - in there you should see current and past job along with their status. To dig into jobs logs you can open the Batch job and its log

alexeyshishkin01 commented 1 year ago

thanks, approximately how long will it take to make a fix for this issue ?

alexeyshishkin01 commented 1 year ago

just fyi - I used the following pdf-file for my upload tests - https://docs.aws.amazon.com/bedrock/latest/userguide/bedrock-ug.pdf

alexeyshishkin01 commented 1 year ago

I have also tested pdf-files upload with Opensearch - it also fails So far, pdf-files upload fails both for Aurora and Opensearch and works well for Kendra

Gmreh commented 1 year ago

I also experienced the same, that most file uploads fail and even those that got successfully updated seem to be not really available for the models because the answer to even general questions to the context is "I don't know". I have to deep dive it but would be nice to have some more info about the workflow and reasons for the fails.

Gmreh commented 1 year ago

When the upload process is successful and I ask the models about content it recognized the pdf but says "I dont have access to the content of the file". Did not find the reason yet, help appreciated. regards

alexeyshishkin01 commented 1 year ago

Here are my findings:

AWS Console -> Step Functions -> State machines -> 'RagEnginesDataImportFileImportWorkflowFileImportStateMachineXXXXXXX-YYYYYYYYYYYY' -> Executions -> choose your Execution In the 'Graph View' choose failed step 'FileImportJob' -> on the right of it click on 'Batch job' link (just under FileImportJob) and you'll be transfered to AWS Batch In AWS Batch screen under 'Job Information' click on 'Log stream name' RagEnginesDataImportFil-xxx and you'll be transfered to AWS CloudWatch

1) upload file - https://github.com/aws-samples/aws-genai-llm-chatbot/blob/main/cdk.json - the result is failed

from AWS CloudWatch:

Detected a JSON file that does not conform to the Unstructured schema. partition_json currently only processes serialized Unstructured output.
File "/app/main.py", line 81, in
File "/app/main.py", line 64, in main
raise error
File "/home/notebook-user/.local/lib/python3.10/site-packages/langchain/document_loaders/unstructured.py", line 86, in load
File "/home/notebook-user/unstructured/partition/auto.py", line 371, in partition
raise ValueError(
ValueError: Detected a JSON file that does not conform to the Unstructured schema. partition_json currently only processes serialized Unstructured output.

2) upload file - https://docs.aws.amazon.com/bedrock/latest/userguide/bedrock-ug.pdf - the result is failed

from AWS CloudWatch:

File "/app/main.py", line 81, in
File "/app/main.py", line 64, in main
raise error
File "/home/notebook-user/.local/lib/python3.10/site-packages/langchain/document_loaders/unstructured.py", line 86, in load
File "/home/notebook-user/.local/lib/python3.10/site-packages/langchain/document_loaders/s3_file.py", line 126, in _get_elements
File "/home/notebook-user/unstructured/partition/auto.py", line 316, in partition
File "/home/notebook-user/unstructured/documents/elements.py", line 280, in wrapper
File "/home/notebook-user/unstructured/file_utils/filetype.py", line 551, in wrapper
File "/home/notebook-user/unstructured/chunking/title.py", line 257, in wrapper
File "/home/notebook-user/unstructured/partition/pdf.py", line 155, in partition_pdf
File "/home/notebook-user/unstructured/partition/pdf.py", line 252, in partition_pdf_or_image
File "/home/notebook-user/unstructured/partition/pdf.py", line 178, in extractable_elements
File "/home/notebook-user/unstructured/utils.py", line 159, in wrapper
File "/home/notebook-user/unstructured/partition/pdf.py", line 430, in _partition_pdf_with_pdfminer
File "/home/notebook-user/unstructured/partition/pdf.py", line 509, in _process_pdfminer_pages
File "/home/notebook-user/unstructured/partition/pdf.py", line 1064, in check_annotations_within_element
calculate_intersection_area(element_bbox, annotation["bbox"])
ZeroDivisionError: float division by zero

alexeyshishkin01 commented 1 year ago

3) upload file - https://github.com/aws-samples/aws-genai-llm-chatbot/blob/main/README.md - the result is success

from AWS CloudWatch:

loader: <langchain.document_loaders.s3_file.S3FileLoader object at 0x7fa653753130>
{'ResponseMetadata': {'RequestId': 'B5MSQAC934U49HQPVM47P82LPRVV4KQNSO5AEMVJF66Q9ASUAAJG', 'HTTPStatusCode': 200, 'HTTPHeaders': {'server': 'Server', 'date': 'Wed, 08 Nov 2023 14:52:46 GMT', 'content-type': 'application/x-amz-json-1.0', 'content-length': '2', 'connection': 'keep-alive', 'x-amzn-requestid': 'B5MSQAC934U49HQPVM47P82LPRVV4KQNSO5AEMVJF66Q9ASUAAJG', 'x-amz-crc32': '2745614147'}, 'RetryAttempts': 0}}
{'ResponseMetadata': {'RequestId': 'N94FAV0KJQOIFK91K8IMFAODK3VV4KQNSO5AEMVJF66Q9ASUAAJG', 'HTTPStatusCode': 200, 'HTTPHeaders': {'server': 'Server', 'date': 'Wed, 08 Nov 2023 14:52:46 GMT', 'content-type': 'application/x-amz-json-1.0', 'content-length': '2', 'connection': 'keep-alive', 'x-amzn-requestid': 'N94FAV0KJQOIFK91K8IMFAODK3VV4KQNSO5AEMVJF66Q9ASUAAJG', 'x-amz-crc32': '2745614147'}, 'RetryAttempts': 0}}

alexeyshishkin01 commented 1 year ago

I have successfully uploaded the following files:

4. https://raw.githubusercontent.com/zalandoresearch/feidegger/master/data/FEIDEGGER_release_1.2.json

5. https://github.com/datablist/sample-csv-files/raw/main/files/customers/customers-100.csv

6. https://raw.githubusercontent.com/dotnet/docs/main/docs/standard/linq/sample-xml-file-typical-purchase-order.md

ntsaini commented 1 year ago

Just FYI, I had a similar issue with the PDF file, I exported the PDF as a docx and then tried to upload. It processed fine. Not sure exactly what is wrong but it could be a temporary workaround in case you are running in the same issue.

QuinnGT commented 1 year ago

Yeah it does the same for me with certain pdf and ppt files. Out of 300ish docs I've uploaded, it's happened on 11 of them. The common error is calculate_intersection_area(element_bbox, annotation["bbox"]) ZeroDivisionError: float division by zero

Since it's using docs = loader.load() from langchain S3fileloader, I don't know that there are many other options in that function. May need to swap to something else.

alexeyshishkin01 commented 1 year ago

May I ask you to look at the https://github.com/aws-samples/aurora-postgresql-pgvector/tree/main/apgpgvector-streamlit ?

I used to upload pdf-files with this and have not got problems so far with the same pdf-files I have issues here.

QuinnGT commented 11 months ago

I found the error. It's due to using an older version of Unstructured. It was fixed in version 0.10.20

We should be on 0.11.2 anyways, so I'll test the change and push the fix soon.

alexeyshishkin01 commented 11 months ago

this is really good news ! how long can it take for you to get the fix and test it (approximately) ?

aws-samples / aws-genai-llm-chatbot

Error during upload of files in the Workspace with Aurora Serverless Workspace Engine #185