aws-samples / aws-genai-conversational-rag-reference

AWS Generative AI Conversational RAG Reference (Galileo)
https://aws-samples.github.io/aws-genai-conversational-rag-reference
Apache License 2.0
73 stars 24 forks source link

[BUG] Failed in embedding sample dataset in vector store #55

Closed sociengineer closed 1 year ago

sociengineer commented 1 year ago

Describe the bug

SageMaker Processing Job fails to embedding sample dataset into the vector store while running in Pipeline Statemachine.

The log in the Processing jobs

INSERT INTO all_mpnet_base_v2_768 (id, source_location, document, cmetadata, embeddings) VALUES \n('42af8ad7-db34-40ff-9241-e4241358ec54','s3://dev-galileo-corpusnested-processeddatabucket4e25d-30b26d922em9/cases/30/case1236.txt','According to the State and NMWA, NEPA requires BLM to complete a supplemental EIS specifically analyzing the likely environmental effects of Alternative A-modified before adopting that alternative as the new management plan for the area, and its failure to do so was arbitrary and capricious.   An agency must prepare a supplemental assessment if “[t]he agency makes substantial changes in the proposed action that are relevant to environmental concerns.” 24  40 C.F.R. § 1502.9(c)(1)(i) (emphases added).   When “the relevant environmental impacts have already been considered” earlier in the NEPA process, no supplement is required.  Friends of Marolt Park v. U.S. Dep''t of Transp., 382 F.3d 1088, 1096-97 (10th Cir.2004).   In a guide to NEPA published in the Federal Register, the CEQ states that a supplement is unnecessary when the new alternative is “qualitatively within the spectrum of alternatives that were discussed in the draft” and is only a “minor variation” from those alternatives.','{\"example\":\"True\",\"category_id\":\"30\",\"domain\":\"Legal\",\"original_source_url\":\"https://osf.io/qvg8s/files/osfstorage\",\"category\":\"Environmental Law\",\"asset_key_prefix\":\"/cases/30/\",\"collection\":\"casefiles\",\"original_location\":\"https://osf.io/8mjcy#preprocessed_cases[cases_29404]/30\",\"original_source\":\"OSF: SigmaLaw - Large Legal Text Corpus and Word Embeddings\",\"source_location\":\"s3://dev-galileo-corpusnested-processeddatabucket4e25d-30b26d922em9/cases/30/case1236.txt\",\"section_index\":71}',array[0.005580055993050337,0.04951924830675125,0.007071380503475666,-0.03756583854556084,-0.017889730632305145,0.009061774238944054,-0.051643624901771545,0.0003611240827012807,-0.09720514714717865,-0.04086209461092949,0.037306345999240875,0.06428683549165726,0.01733892597... (2000 of 17567251)"

Expected Behavior

Successfully embedded sample dataset in the vector store.

Current Behavior

image
{
  "error": "States.TaskFailed",
  "cause": "{\"SdkResponseMetadata\":null,\"SdkHttpMetadata\":null,\"ProcessingInputs\":[{\"InputName\":\"corpus\",\"AppManaged\":false,\"S3Input\":{\"S3Uri\":\"s3://dev-galileo-corpusnested-processeddatabucket4e25d-30b26d922em9/\",\"LocalPath\":\"/opt/ml/processing/input_data\",\"S3DataType\":\"S3_PREFIX\",\"S3InputMode\":\"FILE\",\"S3DataDistributionType\":\"SHARDEDBYS3KEY\",\"S3CompressionType\":null},\"DatasetDefinition\":null}],\"ProcessingOutputConfig\":null,\"ProcessingJobName\":\"5d795bfb-0f59-4b07-8d10-f8e18510af68\",\"ProcessingResources\":{\"ClusterConfig\":{\"InstanceCount\":5,\"InstanceType\":\"ml.g4dn.2xlarge\",\"VolumeSizeInGB\":10,\"VolumeKmsKeyId\":null}},\"StoppingCondition\":{\"MaxRuntimeInSeconds\":30000},\"AppSpecification\":{\"ImageUri\":\"************.dkr.ecr.us-east-1.amazonaws.com/cdk-hnb659fds-container-assets-************-us-east-1:4a721916c12c47cdeb2f88f6791fb3894f52ad3fe8e712a736bc758dab84699a\",\"ContainerEntrypoint\":null,\"ContainerArguments\":[\"sagemaker\"]},\"Environment\":{\"POWERTOOLS_SERVICE_NAME\":\"Dev-Galileo\",\"RDS_PGVECTOR_PROXY_ENDPOINT\":\"primaryproxy.proxy-cel3yq4w8faz.us-east-1.rds.amazonaws.com\",\"INDEXING_BUCKET\":\"dev-galileo-corpusnested-processeddatabucket4e25d-30b26d922em9\",\"POWERTOOLS_METRICS_NAMESPACE\":\"Dev-Galileo\",\"RDS_PGVECTOR_TLS_ENABLED\":\"0\",\"PGSSLMODE\":\"prefer\",\"AWS_DEFAULT_REGION\":\"us-east-1\",\"RDS_PGVECTOR_IAM_AUTH\":\"0\",\"INDEXING_CACHE_TABLE\":\"Dev-Galileo-CorpusNestedStackCorpusNestedStackResource5645BA6D-T8LDHTNTB5HO-CacheTableC1E6DF7E-178KHXNUHOGUB\",\"RDS_PGVECTOR_STORE_SECRET\":\"VectorStoreClusterSecretC38-HIw3qmky1Mqn\",\"INDEXING_SKIP_DELTA_CHECK\":\"1\",\"LOG_LEVEL\":\"DEBUG\"},\"NetworkConfig\":{\"EnableInterContainerTrafficEncryption\":false,\"EnableNetworkIsolation\":false,\"VpcConfig\":{\"SecurityGroupIds\":[\"sg-064463e6a4913145f\"],\"Subnets\":[\"subnet-0273fde3f55611e0e\",\"subnet-065a9a4b4452e5d12\",\"subnet-0427d48feba1cfa07\"]}},\"RoleArn\":\"arn:aws:iam::************:role/Dev-Galileo-CorpusNestedS-PipelineProcessingRole66-1PG29W5DIA4TP\",\"ExperimentConfig\":{\"ExperimentName\":null,\"TrialName\":null,\"TrialComponentDisplayName\":null,\"RunName\":null},\"ProcessingJobArn\":\"arn:aws:sagemaker:us-east-1:************:processing-job/5d795bfb-0f59-4b07-8d10-f8e18510af68\",\"ProcessingJobStatus\":\"Stopped\",\"ExitMessage\":null,\"FailureReason\":null,\"ProcessingEndTime\":1696555857000,\"ProcessingStartTime\":1696525515000,\"LastModifiedTime\":1696555886000,\"CreationTime\":1696525220000,\"LastModifiedBy\":null,\"CreatedBy\":null,\"MonitoringScheduleArn\":null,\"AutoMLJobArn\":null,\"TrainingJobArn\":null,\"Tags\":{\"AWS_STEP_FUNCTIONS_EXECUTION_ARN\":\"arn:aws:states:us-east-1:************:execution:PipelineStateMachine1814BD05-IIOD7bcaQzY8:0bbb5c3c-a393-46a5-9468-f9e34f9c8146\",\"MANAGED_BY_AWS\":\"STARTED_BY_STEP_FUNCTIONS\"}}"
}

Reproduction Steps

deploy sample dataset stack

Possible Solution

No response

Additional Information/Context

No response

Environment details (OS name and version, etc.)

No response

sociengineer commented 1 year ago

When I sent a message, there's no response.

image
sociengineer commented 1 year ago

@JeremyJonas Re-executing the Step Functions resolved the processing job, but failed to run VectorStoreIndexTasK.

The cause could not be determined because Lambda did not return an error type. Returned payload: {"errorMessage":"2023-10-23T17:23:38.340Z ed6714b3-dba5-4f6a-b3a1-1ab0d150e4a0 Task timed out after 900.02 seconds"}

JeremyJonas commented 1 year ago

Fixed by #66

@sociengineer - Changed indexing processing job from Lambda (15 min max) to ECS task (no timeout) to ensure long running RDS indexing task completes as part of the pipeline step function.