audienceproject / spark-dynamodb

Plug-and-play implementation of an Apache Spark custom data source for AWS DynamoDB.
Apache License 2.0
175 stars 90 forks source link

DynamoDB's write mode is batch mode, but option "update" is "true" #99

Open eltbus opened 3 years ago

eltbus commented 3 years ago

I manually add a field with the current time stamp to use it as a TTL in DynamoDB. Sometimes all the fields in a row are the same, but I'd like to extend the TTL.

To do so, I tried using PySparks' append mode with .option('update', 'true') but items don't seem to get updated. So I tried using PySparks' overwrite:

(df.write
    .mode('overwrite')
    .option('update', 'true')
    .option('tableName', 'MY_TABLE')
    .option('region', 'eu-west-1')
    .format('dynamodb')
    .save())

Sadly this does not work and raises the following error.

Traceback (most recent call last):
  File "/tmp/PYSPARK_FILE.py", line 80, in <module>
    main(spark)
  File "/tmp/PYSPARK_FILE.py", line 51, in main
    (df.write
  File "/home/spark/spark-3.0.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 825, in save
  File "/home/spark/spark-3.0.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/home/spark/spark-3.0.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 134, in deco
  File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Table MY_TABLE does not support truncate in batch mode.;;
OverwriteByExpression RelationV2[<fields>] MY_TABLE, true, Map(update -> true, tableName -> MY_TABLE, region -> eu-west-1), true

This left me wondering... why does it say "batch mode"? Is it not accepting DynamoDB's updateItem mode?

Additional packages used:

eltbus commented 3 years ago

Nevermind... this is correcty updating overwriting the item.

(df.write
    .mode('append')
    .option('tableName', 'MY_TABLE')
    .option('region', 'eu-west-1')
    .format('dynamodb')
    .save())

I must probably have an error somewhere else.

But I guess I still don't know if .option('update', 'true') works, or if Batch mode is always used.