aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.1k stars 1.14k forks source link

An error occurred (ValidationError) when calling the PutRecord operation: Resource Not Found: Amazon SageMaker can't find a FeatureGroup #2606

Closed oonisim closed 10 months ago

oonisim commented 3 years ago

Describe the bug Feature Group ingest method fails to insert several records sporadically and throws an exception Failed to ingest row 0: An error occurred (ValidationError) when calling the PutRecord operation: Resource Not Found: Amazon SageMaker can't find a FeatureGroup with name although the FeatureGroup exists. Sometime one record, sometime multiple records, and there is no specific pattern.

Querying from the Feature Group returns records, so the FeatureGroup is there and records have been inserted.

feature_store_query.run(
    query_string=query_string,
    output_location=feature_group_query_uri,
)

feature_store_query.wait()
feature_store_query.as_dataframe().head()
-----
review_id | star_rating | review_date
-- | -- | --
RM4XTAWT3FV8S | 1 | 2015-08-03T00:00:00Z
RM4XTAWT3FV8S | 1 | 2015-08-03T00:00:00Z
RM4XTAWT3FV8S | 1 | 2015-08-03T00:00:00Z
R195KUJIQS3UR7 | 5 | 2015-08-03T00:00:00Z
R195KUJIQS3UR7 | 5 | 2015-08-03T00:00:00Z

To reproduce

  1. Open SageMaker studio in us-east-1 in non-VPC deployment.
  2. Run below and feature_group.ingest(...) causes the exception in the SageMaker studio.
    
    import time
    import json
    import multiprocessing

import sagemaker from sagemaker.session import Session from sagemaker import get_execution_role

NUM_CPUS = multiprocessing.cpu_count() role = get_execution_role() session = sagemaker.Session() region = session.boto_region_name bucket = session.default_bucket()

from sagemaker.feature_store.feature_definition import ( FeatureDefinition, FeatureTypeEnum, )

feature_definitions = [ FeatureDefinition(feature_name="review_id", feature_type=FeatureTypeEnum.STRING), FeatureDefinition(feature_name="review_date", feature_type=FeatureTypeEnum.STRING), FeatureDefinition(feature_name="star_rating", feature_type=FeatureTypeEnum.INTEGRAL), ]

feature_group_prefix = "sagemaker-feature-group" feature_group_name = "amazon-product-review" feature_group_offline_uri = f"s3://{bucket}/{feature_group_prefix}/{feature_group_name}/features" feature_group_query_uri = f"s3://{bucket}/{feature_group_prefix}/{feature_group_name}/queries"

record_identifier_feature_name = "review_id" event_time_feature_name = "review_date"

from sagemaker.feature_store.feature_group import FeatureGroup

feature_group = FeatureGroup( name=feature_group_name, feature_definitions=feature_definitions, sagemaker_session=session )

def wait_for_feature_group_creation_complete(feature_group): status = feature_group.describe().get("FeatureGroupStatus") print("Waiting for Feature Group Creation") print("Feature Group status: {}".format(status)) while status == "Creating": time.sleep(5) status = feature_group.describe().get("FeatureGroupStatus") print("Feature Group status: {}".format(status))

if status != "Created":
    print("Feature Group creation failed. Status: {}".format(status))
    raise RuntimeError(f"Failed to create feature group {feature_group.name}")
else:
    print(f"FeatureGroup {feature_group.name} successfully created.")

try: print("Creating Feature Group with role {}...".format(role)) response = feature_group.create( s3_uri=feature_group_offline_uri, record_identifier_name=record_identifier_feature_name, event_time_feature_name=event_time_feature_name, role_arn=role, enable_online_store=True, )

print("Waiting for new Feature Group to become available...")
wait_for_feature_group_creation_complete(feature_group)
feature_group.describe()

print("Creating Feature Group. Completed.")

except Exception as e: raise RuntimeError("Feature Group creation failed: {}".format(e)) from e

client = session.boto_session.client( "sagemaker", region_name=region ) client.list_feature_groups() feature_group.describe()

import pandas as pd import s3fs

amazon_product_review_bucket = "amazon-reviews-pds" generator = pd.read_csv( f"s3://{amazon_product_review_bucket}/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz", header=0, usecols=["review_id", "star_rating", "review_date"], parse_dates=["review_date"], sep='\t', compression="gzip", chunksize=1024 (NUM_CPUS -1) 3 )

df = next(generator) df.dropna(inplace=True) df['review_date'] = df['review_date'].dt.strftime('%Y-%m-%dT%H:%M:%SZ') df.head()

feature_group.describe() feature_group.ingest( # <---------- Cause the error data_frame = df, max_processes=1, max_workers=3, wait=True )

feature_store_query = feature_group.athena_query() feature_store_table = feature_store_query.table_name

query_string = """ SELECT review_id, star_rating, review_date FROM "{}" LIMIT 10 """.format( feature_store_table )

print("Running " + query_string)

feature_store_query.run( query_string=query_string, output_location=feature_group_query_uri, )

feature_store_query.wait() feature_store_query.as_dataframe()


**Expected behavior**
All the records get inserted successfully.

**Screenshots or logs**

Failed to ingest row 0: An error occurred (ValidationError) when calling the PutRecord operation: Resource Not Found: Amazon SageMaker can't find a FeatureGroup with name [amazon-product-review]. Failed to ingest row 0 to 1024

IngestionError Traceback (most recent call last)

in 4 max_workers=3, 5 # timeout=3, ----> 6 wait=True 7 ) /opt/conda/lib/python3.7/site-packages/sagemaker/feature_store/feature_group.py in ingest(self, data_frame, max_workers, max_processes, wait, timeout) 596 ) 597 --> 598 manager.run(data_frame=data_frame, wait=wait, timeout=timeout) 599 600 return manager /opt/conda/lib/python3.7/site-packages/sagemaker/feature_store/feature_group.py in run(self, data_frame, wait, timeout) 347 if timeout is reached. 348 """ --> 349 self._run_multi_process(data_frame=data_frame, wait=wait, timeout=timeout) 350 351 /opt/conda/lib/python3.7/site-packages/sagemaker/feature_store/feature_group.py in _run_multi_process(self, data_frame, wait, timeout) 290 291 if wait: --> 292 self.wait(timeout=timeout) 293 294 def _run_multi_threaded(self, data_frame: DataFrame, row_offset=0, timeout=None) -> List[int]: /opt/conda/lib/python3.7/site-packages/sagemaker/feature_store/feature_group.py in wait(self, timeout) 259 raise IngestionError( 260 self._failed_indices, --> 261 f"Failed to ingest some data into FeatureGroup {self.feature_group_name}", 262 ) 263 IngestionError: [0] -> Failed to ingest some data into FeatureGroup amazon-product-review ``` ``` Failed to ingest row 0: An error occurred (ValidationError) when calling the PutRecord operation: Resource Not Found: Amazon SageMaker can't find a FeatureGroup with name [amazon-product-review]. Failed to ingest row 0 to 1024 --------------------------------------------------------------------------- IngestionError Traceback (most recent call last) in 4 max_workers=3, 5 timeout=30, ----> 6 wait=True 7 ) /opt/conda/lib/python3.7/site-packages/sagemaker/feature_store/feature_group.py in ingest(self, data_frame, max_workers, max_processes, wait, timeout) 596 ) 597 --> 598 manager.run(data_frame=data_frame, wait=wait, timeout=timeout) 599 600 return manager /opt/conda/lib/python3.7/site-packages/sagemaker/feature_store/feature_group.py in run(self, data_frame, wait, timeout) 347 if timeout is reached. 348 """ --> 349 self._run_multi_process(data_frame=data_frame, wait=wait, timeout=timeout) 350 351 /opt/conda/lib/python3.7/site-packages/sagemaker/feature_store/feature_group.py in _run_multi_process(self, data_frame, wait, timeout) 290 291 if wait: --> 292 self.wait(timeout=timeout) 293 294 def _run_multi_threaded(self, data_frame: DataFrame, row_offset=0, timeout=None) -> List[int]: /opt/conda/lib/python3.7/site-packages/sagemaker/feature_store/feature_group.py in wait(self, timeout) 259 raise IngestionError( 260 self._failed_indices, --> 261 f"Failed to ingest some data into FeatureGroup {self.feature_group_name}", 262 ) 263 IngestionError: [0] -> Failed to ingest some data into FeatureGroup amazon-product-review ``` **System information** A description of your system. Please provide: - **SageMaker Python SDK version**: '2.49.1' - **Framework name (eg. PyTorch) or algorithm (eg. KMeans)**: NA - **Framework version**: - **Python version**: 3.7.10 (default, Jun 4 2021, 14:48:32) [GCC 7.5.0] - **CPU or GPU**: CPU (SageMaker studio DataScience kernel. - **Custom Docker image (Y/N)**: N **Additional context** Add any other context about the problem here.
liyunrui commented 2 years ago

got same issues. Does it solve ?

Pooja-Karangale commented 1 year ago

I am facing the same issue, any lead for the solution. can someone help me with another way of ingesting data to feature group.

psnilesh commented 1 year ago

This look like a service issue, not SDK's.

@Pooja-Karangale would you have some IDs of failed requests, along with time and region that can help us debug ?

EDIT: Looks like boto3 won't log request IDs even for exceptional cases unless verbose logging is enabled. If you have request IDs, please share them here. Otherwise, open a case with AWS support where you can share more sensitive details like Account ID that'll help with the investigation.

OmarDispatch commented 1 year ago

Does anyone know if this was fixed or if there is a workaround? I have the same problem

jiapinw commented 1 year ago

If you see this issue consistently, please share the request IDs, region and a timeframe when you encountered this issue. Alternatively, you can open a case with AWS support to provide more sensitive details.

mufaddal-rohawala commented 10 months ago

Thank you for opening this issue. Closing this issue as per the above comment. Please feel free to reopen if you continue to see this issue with the latest sagemaker version.