aws / sagemaker-feature-store-spark

Apache License 2.0
6 stars 3 forks source link

TypeError: 'JavaPackage' object is not callable when calling FeatureStoreManager #20

Open francescocamussoni opened 1 year ago

francescocamussoni commented 1 year ago

Hello, I've this code that is pretty similar to the example provided by aws: https://docs.aws.amazon.com/sagemaker/latest/dg/batch-ingestion-spark-connector-setup.html

This is my java version: java version "1.8.0_231" Java(TM) SE Runtime Environment (build 1.8.0_231-b11) Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)

from pyspark.sql import SparkSession
from feature_store_pyspark.FeatureStoreManager import FeatureStoreManager
import feature_store_pyspark

spark = SparkSession.builder \
                    .getOrCreate()

df = spark.createDataFrame(orders_data).toDF(*orders_data.columns)

# Initialize FeatureStoreManager with a role arn if your feature group is created by another account
feature_store_manager= FeatureStoreManager('<MY_ROLE_ARN>')

# Load the feature definitions from input schema. The feature definitions can be used to create a feature group
feature_definitions = feature_store_manager.load_feature_definitions_from_schema(df)

But I get this error:

23/08/24 13:42:11 DEBUG FileSystem: Loading filesystems
23/08/24 13:42:11 DEBUG FileSystem: file:// = class org.apache.hadoop.fs.LocalFileSystem from /usr/local/lib/python3.8/site-packages/pyspark/jars/hadoop-client-api-3.3.4.jar
23/08/24 13:42:11 DEBUG FileSystem: viewfs:// = class org.apache.hadoop.fs.viewfs.ViewFileSystem from /usr/local/lib/python3.8/site-packages/pyspark/jars/hadoop-client-api-3.3.4.jar
23/08/24 13:42:11 DEBUG FileSystem: har:// = class org.apache.hadoop.fs.HarFileSystem from /usr/local/lib/python3.8/site-packages/pyspark/jars/hadoop-client-api-3.3.4.jar
23/08/24 13:42:11 DEBUG FileSystem: http:// = class org.apache.hadoop.fs.http.HttpFileSystem from /usr/local/lib/python3.8/site-packages/pyspark/jars/hadoop-client-api-3.3.4.jar
23/08/24 13:42:11 DEBUG FileSystem: https:// = class org.apache.hadoop.fs.http.HttpsFileSystem from /usr/local/lib/python3.8/site-packages/pyspark/jars/hadoop-client-api-3.3.4.jar
23/08/24 13:42:11 DEBUG FileSystem: hdfs:// = class org.apache.hadoop.hdfs.DistributedFileSystem from /usr/local/lib/python3.8/site-packages/pyspark/jars/hadoop-client-api-3.3.4.jar
23/08/24 13:42:11 DEBUG FileSystem: webhdfs:// = class org.apache.hadoop.hdfs.web.WebHdfsFileSystem from /usr/local/lib/python3.8/site-packages/pyspark/jars/hadoop-client-api-3.3.4.jar
23/08/24 13:42:11 DEBUG FileSystem: swebhdfs:// = class org.apache.hadoop.hdfs.web.SWebHdfsFileSystem from /usr/local/lib/python3.8/site-packages/pyspark/jars/hadoop-client-api-3.3.4.jar
23/08/24 13:42:11 DEBUG FileSystem: nullscan:// = class org.apache.hadoop.hive.ql.io.NullScanFileSystem from /usr/local/lib/python3.8/site-packages/pyspark/jars/hive-exec-2.3.9-core.jar
23/08/24 13:42:11 DEBUG FileSystem: file:// = class org.apache.hadoop.hive.ql.io.ProxyLocalFileSystem from /usr/local/lib/python3.8/site-packages/pyspark/jars/hive-exec-2.3.9-core.jar
23/08/24 13:42:11 DEBUG FileSystem: Looking for FS supporting file
23/08/24 13:42:11 DEBUG FileSystem: looking for configuration option fs.file.impl
23/08/24 13:42:11 DEBUG FileSystem: Looking in service filesystems for implementation class
23/08/24 13:42:11 DEBUG FileSystem: FS for file is class org.apache.hadoop.hive.ql.io.ProxyLocalFileSystem
23/08/24 13:42:11 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
23/08/24 13:42:11 DEBUG SharedState: Applying other initial session options to HadoopConf: spark.jars.packages -> hadoop-aws-3.2.1-javadoc.jar,hadoop-common-3.2.1.jar
23/08/24 13:42:11 DEBUG SharedState: Applying other initial session options to HadoopConf: spark.jars -> /usr/local/lib/python3.8/site-packages/feature_store_pyspark/jars/sagemaker-feature-store-spark-sdk.jar
23/08/24 13:42:11 DEBUG FileSystem: Starting: Acquiring creator semaphore for file:/home/sagemaker-user/featurestore/aws/spark-warehouse
23/08/24 13:42:11 DEBUG FileSystem: Acquiring creator semaphore for file:/home/sagemaker-user/featurestore/aws/spark-warehouse: duration 0:00.001s
23/08/24 13:42:11 DEBUG FileSystem: Starting: Creating FS file:/home/sagemaker-user/featurestore/aws/spark-warehouse
23/08/24 13:42:11 DEBUG FileSystem: Looking for FS supporting file
23/08/24 13:42:11 DEBUG FileSystem: looking for configuration option fs.file.impl
23/08/24 13:42:11 DEBUG FileSystem: Looking in service filesystems for implementation class
23/08/24 13:42:11 DEBUG FileSystem: FS for file is class org.apache.hadoop.hive.ql.io.ProxyLocalFileSystem
23/08/24 13:42:11 DEBUG FileSystem: Creating FS file:/home/sagemaker-user/featurestore/aws/spark-warehouse: duration 0:00.006s
23/08/24 13:42:11 INFO SharedState: Warehouse path is 'file:/home/sagemaker-user/featurestore/aws/spark-warehouse'.
23/08/24 13:42:11 DEBUG FsUrlStreamHandlerFactory: Creating handler for protocol jar
23/08/24 13:42:11 DEBUG FileSystem: Looking for FS supporting jar
23/08/24 13:42:11 DEBUG FileSystem: looking for configuration option fs.jar.impl
23/08/24 13:42:11 DEBUG FsUrlStreamHandlerFactory: Creating handler for protocol file
23/08/24 13:42:11 DEBUG FileSystem: Looking for FS supporting file
23/08/24 13:42:11 DEBUG FileSystem: looking for configuration option fs.file.impl
23/08/24 13:42:11 DEBUG FileSystem: Looking in service filesystems for implementation class
23/08/24 13:42:11 DEBUG FileSystem: FS for file is class org.apache.hadoop.hive.ql.io.ProxyLocalFileSystem
23/08/24 13:42:11 DEBUG FsUrlStreamHandlerFactory: Found implementation of file: class org.apache.hadoop.hive.ql.io.ProxyLocalFileSystem
23/08/24 13:42:11 DEBUG FsUrlStreamHandlerFactory: Using handler for protocol file
23/08/24 13:42:11 DEBUG FileSystem: Looking in service filesystems for implementation class
23/08/24 13:42:11 DEBUG FsUrlStreamHandlerFactory: Unknown protocol jar, delegating to default implementation
23/08/24 13:42:13 DEBUG ClosureCleaner: Cleaning indylambda closure: $anonfun$pythonToJava$1
23/08/24 13:42:13 DEBUG ClosureCleaner:  +++ indylambda closure ($anonfun$pythonToJava$1) is now cleaned +++
23/08/24 13:42:13 DEBUG ClosureCleaner: Cleaning indylambda closure: $anonfun$toJavaArray$1
23/08/24 13:42:13 DEBUG ClosureCleaner:  +++ indylambda closure ($anonfun$toJavaArray$1) is now cleaned +++
23/08/24 13:42:13 DEBUG ClosureCleaner: Cleaning indylambda closure: $anonfun$applySchemaToPythonRDD$1
23/08/24 13:42:13 DEBUG ClosureCleaner:  +++ indylambda closure ($anonfun$applySchemaToPythonRDD$1) is now cleaned +++
23/08/24 13:42:13 DEBUG CatalystSqlParser: Parsing command: spark_grouping_id
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [23], in <cell line: 11>()
      8 df = spark.createDataFrame(orders_data).toDF(*orders_data.columns)
     10 # Initialize FeatureStoreManager with a role arn if your feature group is created by another account
---> 11 feature_store_manager= FeatureStoreManager('<MY_ROLE_ARN>')
     13 # Load the feature definitions from input schema. The feature definitions can be used to create a feature group
     14 feature_definitions = feature_store_manager.load_feature_definitions_from_schema(df)

File /usr/local/lib/python3.8/site-packages/feature_store_pyspark/FeatureStoreManager.py:32, in FeatureStoreManager.__init__(self, assume_role_arn)
     30 def __init__(self, assume_role_arn: string = None):
     31     super(FeatureStoreManager, self).__init__()
---> 32     self._java_obj = self._new_java_obj(FeatureStoreManager._wrapped_class, assume_role_arn)

File /usr/local/lib/python3.8/site-packages/feature_store_pyspark/wrapper.py:45, in SageMakerFeatureStoreJavaWrapper._new_java_obj(self, java_class, *args)
     42 for arg in args:
     43     java_args.append(self._py2j(arg))
---> 45 return JavaWrapper._new_java_obj(java_class, *java_args)

File /usr/local/lib/python3.8/site-packages/pyspark/ml/wrapper.py:86, in JavaWrapper._new_java_obj(java_class, *args)
     84     java_obj = getattr(java_obj, name)
     85 java_args = [_py2java(sc, arg) for arg in args]
---> 86 return java_obj(*java_args)

TypeError: 'JavaPackage' object is not callable

I've also tried with specifying the jars that I've locally:

from pyspark.sql import SparkSession
from feature_store_pyspark.FeatureStoreManager import FeatureStoreManager
import feature_store_pyspark

extra_jars = ",".join(feature_store_pyspark.classpath_jars())

spark = SparkSession.builder \
    .config("spark.jars", extra_jars) \
    .config("spark.jars.packages", "hadoop-aws-3.2.1-javadoc.jar,hadoop-common-3.2.1.jar").getOrCreate()

df = spark.createDataFrame(orders_data).toDF(*orders_data.columns)

# Initialize FeatureStoreManager with a role arn if your feature group is created by another account
feature_store_manager= FeatureStoreManager('<MY_ROLE_ARN>')

# Load the feature definitions from input schema. The feature definitions can be used to create a feature group
feature_definitions = feature_store_manager.load_feature_definitions_from_schema(df)

But I get the same final error:

23/08/24 13:55:09 DEBUG ClosureCleaner: Cleaning indylambda closure: $anonfun$pythonToJava$1
23/08/24 13:55:09 DEBUG ClosureCleaner:  +++ indylambda closure ($anonfun$pythonToJava$1) is now cleaned +++
23/08/24 13:55:09 DEBUG ClosureCleaner: Cleaning indylambda closure: $anonfun$toJavaArray$1
23/08/24 13:55:09 DEBUG ClosureCleaner:  +++ indylambda closure ($anonfun$toJavaArray$1) is now cleaned +++
23/08/24 13:55:09 DEBUG ClosureCleaner: Cleaning indylambda closure: $anonfun$applySchemaToPythonRDD$1
23/08/24 13:55:09 DEBUG ClosureCleaner:  +++ indylambda closure ($anonfun$applySchemaToPythonRDD$1) is now cleaned +++
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [25], in <cell line: 14>()
     11 df = spark.createDataFrame(orders_data).toDF(*orders_data.columns)
     13 # Initialize FeatureStoreManager with a role arn if your feature group is created by another account
---> 14 feature_store_manager= FeatureStoreManager('<MY_ROLE_ARN>')
     16 # Load the feature definitions from input schema. The feature definitions can be used to create a feature group
     17 feature_definitions = feature_store_manager.load_feature_definitions_from_schema(df)

File /usr/local/lib/python3.8/site-packages/feature_store_pyspark/FeatureStoreManager.py:32, in FeatureStoreManager.__init__(self, assume_role_arn)
     30 def __init__(self, assume_role_arn: string = None):
     31     super(FeatureStoreManager, self).__init__()
---> 32     self._java_obj = self._new_java_obj(FeatureStoreManager._wrapped_class, assume_role_arn)

File /usr/local/lib/python3.8/site-packages/feature_store_pyspark/wrapper.py:45, in SageMakerFeatureStoreJavaWrapper._new_java_obj(self, java_class, *args)
     42 for arg in args:
     43     java_args.append(self._py2j(arg))
---> 45 return JavaWrapper._new_java_obj(java_class, *java_args)

File /usr/local/lib/python3.8/site-packages/pyspark/ml/wrapper.py:86, in JavaWrapper._new_java_obj(java_class, *args)
     84     java_obj = getattr(java_obj, name)
     85 java_args = [_py2java(sc, arg) for arg in args]
---> 86 return java_obj(*java_args)

TypeError: 'JavaPackage' object is not callable

Where I've replaced the real arn with for this post :)

MikelMoli commented 2 months ago

Had the same error and solved it by installing the sagemaker-feature-store-pyspark-3.1 like this:

python -m pip install sagemaker-feature-store-pyspark-3.1 --no-binary :all: --no-cache-dir

Previously it wasn't working as I installed it through a requirements.txt file which didn't include the --no-binary :all: --no-cache-dir part.