audienceproject / spark-dynamodb

Plug-and-play implementation of an Apache Spark custom data source for AWS DynamoDB.
Apache License 2.0
175 stars 90 forks source link

Infer schema in Python #70

Closed rbabu7 closed 4 years ago

rbabu7 commented 4 years ago

I don't see an option to provide schema in pyspark, while same option is in scala. Please let me know how to provide a class while reading data using pyspark

jacobfi commented 4 years ago

Hello!

Thank you for using the library.

Currently the solution would be to specify the schema manually in Python as documented here: https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#programmatically-specifying-the-schema

Example:

from pyspark.sql.types import *

fields = [StructField("someField", StringType(), True), StructField("someOtherField", StringType(), True)]
schema = StructType(fields)

dynamoDf = spark.read \
  .option("tableName", "SomeTableName") \
  .schema(schema) \ # <-- Here we specify the schema manually
  .format("dynamodb") \
  .load()

Does this solve your problem?

Thanks, Jacob

rbabu7 commented 4 years ago

Thank you Jacob , this works for me. Also is there a way to specify the GSI when querying dynamodb or is this intelligent to figure out on its own based on predicate provided?

jacobfi commented 4 years ago

Hi rbabu7

To read from a GSI you can use the following option:

dynamoDf = spark.read \
  .option("tableName", "SomeTableName") \
  .option("indexName", "YourIndexName") \ # <-- Here we specify the GSI
  .schema(schema) \
  .format("dynamodb") \
  .load()

Can we close the issue?

Thanks, Jacob

jacobfi commented 4 years ago

Unfortunately it is not yet intelligent enough to use query over scan, even for global secondary index. See #62

rbabu7 commented 4 years ago

Thank you for your support