Closed alexeyshishkin01 closed 10 months ago
the same situation with amazon.titan-embed-text-v1 model
We are adding documentation to this repo to faciliate such debugging.
In the meantime, file ingestions are handled via Step functions which triggers AWS Batch Jobs
The first step is to look out for a step function prefixed with RagEnginesDataImportFileImportWorkflow
- in there you should see current and past job along with their status. To dig into jobs logs you can open the Batch job and its log
thanks, approximately how long will it take to make a fix for this issue ?
just fyi - I used the following pdf-file for my upload tests - https://docs.aws.amazon.com/bedrock/latest/userguide/bedrock-ug.pdf
I have also tested pdf-files upload with Opensearch - it also fails So far, pdf-files upload fails both for Aurora and Opensearch and works well for Kendra
I also experienced the same, that most file uploads fail and even those that got successfully updated seem to be not really available for the models because the answer to even general questions to the context is "I don't know". I have to deep dive it but would be nice to have some more info about the workflow and reasons for the fails.
When the upload process is successful and I ask the models about content it recognized the pdf but says "I dont have access to the content of the file". Did not find the reason yet, help appreciated. regards
Here are my findings:
AWS Console -> Step Functions -> State machines -> 'RagEnginesDataImportFileImportWorkflowFileImportStateMachineXXXXXXX-YYYYYYYYYYYY' -> Executions -> choose your Execution In the 'Graph View' choose failed step 'FileImportJob' -> on the right of it click on 'Batch job' link (just under FileImportJob) and you'll be transfered to AWS Batch In AWS Batch screen under 'Job Information' click on 'Log stream name' RagEnginesDataImportFil-xxx and you'll be transfered to AWS CloudWatch
1) upload file - https://github.com/aws-samples/aws-genai-llm-chatbot/blob/main/cdk.json - the result is failed
from AWS CloudWatch:
2) upload file - https://docs.aws.amazon.com/bedrock/latest/userguide/bedrock-ug.pdf - the result is failed
from AWS CloudWatch:
3) upload file - https://github.com/aws-samples/aws-genai-llm-chatbot/blob/main/README.md - the result is success
from AWS CloudWatch:
I have successfully uploaded the following files:
4. https://raw.githubusercontent.com/zalandoresearch/feidegger/master/data/FEIDEGGER_release_1.2.json
5. https://github.com/datablist/sample-csv-files/raw/main/files/customers/customers-100.csv
Just FYI, I had a similar issue with the PDF file, I exported the PDF as a docx and then tried to upload. It processed fine. Not sure exactly what is wrong but it could be a temporary workaround in case you are running in the same issue.
Yeah it does the same for me with certain pdf and ppt files. Out of 300ish docs I've uploaded, it's happened on 11 of them. The common error is
calculate_intersection_area(element_bbox, annotation["bbox"]) ZeroDivisionError: float division by zero
Since it's using docs = loader.load()
from langchain S3fileloader, I don't know that there are many other options in that function. May need to swap to something else.
May I ask you to look at the https://github.com/aws-samples/aurora-postgresql-pgvector/tree/main/apgpgvector-streamlit ?
I used to upload pdf-files with this and have not got problems so far with the same pdf-files I have issues here.
this is really good news ! how long can it take for you to get the fix and test it (approximately) ?
How can I troubleshoot the errors during of files in the Workspace with Aurora Serverless Workspace Engine ? Currently I cannot upload pdf/json files. I tried both sentence-transformers/all-MiniLM-L6-v2 and intfloat/multilingual-e5-large models.