apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.48k stars 2.44k forks source link

[BUG] MOR Table Hard Deletes Create issue with Athena Querying RT Tables #7430

Closed soumilshah1995 closed 1 year ago

soumilshah1995 commented 1 year ago

Hello All Hope yuou are doing well and I wanted to report this as a BUG on MOR with Apache HUDI. Here is what happened when I selected MOR and created apache HUDI table performed append and updates things worked perfectly alright. I saw two tables creates ro(read optimized tables ) and RT(Real time ) which has all latest commits. Now I was able to query both the table using Athena and things break when I started perfoming hard delete. I inserted some sample data into data lake emp 0 ,1,3,4,5,6,7 and now I deleted the emp_4 delete was successful but then I am no longer able to query my RT tables using Athena and shows following error

Test 1 (NO DELETES DONE SO FAR) APPEND AND UPDATES

image

image

image

Deletes

image

Code

try:
    import os
    import sys
    import uuid

    import pyspark
    from pyspark import SparkConf, SparkContext
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import col, asc, desc
    from awsglue.utils import getResolvedOptions
    from awsglue.dynamicframe import DynamicFrame
    from awsglue.context import GlueContext

    from faker import Faker

    print("All modules are loaded .....")

except Exception as e:
    print("Some modules are missing {} ".format(e))

# ----------------------------------------------------------------------------------------
#                 Settings
# -----------------------------------------------------------------------------------------

database_name1 = "hudidb"
table_name = "hudi_table"
base_s3_path = "s3a://glue-learn-begineers"
final_base_path = "{base_s3_path}/{table_name}".format(
    base_s3_path=base_s3_path, table_name=table_name
)

# ----------------------------------------------------------------------------------------------------
global faker
faker = Faker()

class DataGenerator(object):

    @staticmethod
    def get_data():
        return [
            (
                x,
                faker.name(),
                faker.random_element(elements=('IT', 'HR', 'Sales', 'Marketing')),
                faker.random_element(elements=('CA', 'NY', 'TX', 'FL', 'IL', 'RJ')),
                faker.random_int(min=10000, max=150000),
                faker.random_int(min=18, max=60),
                faker.random_int(min=0, max=100000),
                faker.unix_time()
            ) for x in range(5)
        ]

def create_spark_session():
    spark = SparkSession \
        .builder \
        .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \
        .getOrCreate()
    return spark

spark = create_spark_session()
sc = spark.sparkContext
glueContext = GlueContext(sc)

"""
CHOOSE ONE 
"hoodie.datasource.write.storage.type": "MERGE_ON_READ",
"hoodie.datasource.write.storage.type": "COPY_ON_WRITE",
"""

hudi_options = {
    'hoodie.table.name': table_name,
    "hoodie.datasource.write.storage.type": "MERGE_ON_READ",
    'hoodie.datasource.write.recordkey.field': 'emp_id',
    'hoodie.datasource.write.table.name': table_name,
    'hoodie.datasource.write.operation': 'upsert',
    'hoodie.datasource.write.precombine.field': 'ts',

    'hoodie.datasource.hive_sync.enable': 'true',
    "hoodie.datasource.hive_sync.mode":"hms",
    'hoodie.datasource.hive_sync.sync_as_datasource': 'false',
    'hoodie.datasource.hive_sync.database': database_name1,
    'hoodie.datasource.hive_sync.table': table_name,
    'hoodie.datasource.hive_sync.use_jdbc': 'false',
    'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
    'hoodie.datasource.write.hive_style_partitioning': 'true',

}

# ====================================================
"""Create Spark Data Frame """
# ====================================================
# data = DataGenerator.get_data()
#
# columns = ["emp_id", "employee_name", "department", "state", "salary", "age", "bonus", "ts"]
# df = spark.createDataFrame(data=data, schema=columns)
# df.write.format("hudi").options(**hudi_options).mode("overwrite").save(final_base_path)

# ====================================================
"""APPEND """
# ====================================================

# impleDataUpd = [
#     (6, "This is APPEND", "Sales", "RJ", 81000, 30, 23000, 827307999),
#     (7, "This is APPEND", "Engineering", "RJ", 79000, 53, 15000, 1627694678),
# ]
#
# columns = ["emp_id", "employee_name", "department", "state", "salary", "age", "bonus", "ts"]
# usr_up_df = spark.createDataFrame(data=impleDataUpd, schema=columns)
# usr_up_df.write.format("hudi").options(**hudi_options).mode("append").save(final_base_path)
#

# ====================================================
"""UPDATE """
# ====================================================
# impleDataUpd = [
#     (3, "this is update on data lake", "Sales", "RJ", 81000, 30, 23000, 827307999),
# ]
# columns = ["emp_id", "employee_name", "department", "state", "salary", "age", "bonus", "ts"]
# usr_up_df = spark.createDataFrame(data=impleDataUpd, schema=columns)
# usr_up_df.write.format("hudi").options(**hudi_options).mode("append").save(final_base_path)
#

# ====================================================
"""HARD DELETE """
# ====================================================

hudi_hard_delete_options = {
    'hoodie.table.name': table_name,
    'hoodie.datasource.write.recordkey.field': 'emp_id',
    'hoodie.datasource.write.table.name': table_name,
    'hoodie.datasource.write.operation': 'delete',
    'hoodie.datasource.write.precombine.field': 'ts',
    'hoodie.upsert.shuffle.parallelism': 2,
    'hoodie.insert.shuffle.parallelism': 2
}

print("\n")
hard_delete_df = spark.sql("SELECT * FROM hudidb.hudi_table_rt where emp_id='4' ")
print(hard_delete_df.show())
print("\n")
hard_delete_df.write.format("hudi").options(**hudi_hard_delete_options).mode("append").save(final_base_path)

if i am doing something wrong happy to change it i think with hard deletes when i performed i cannot query the data now on athena image

Error GENERIC_INTERNAL_ERROR: org/objenesis/strategy/InstantiatorStrategy

Glue Version 4

image

Steps how i configured Glue Job

https://drive.google.com/file/d/1mkED3AUZBARsgeRCzk0K0XMvSyajo7mQ/view

codope commented 1 year ago

@soumilshah1995 Your steps look good to me. Do you have the full stacktrace? org/objenesis/strategy/InstantiatorStrategy is used during Kryo (de)serialization. It will happen in the path of reading/writing a delete block in MOR table. But, let's confirm through stacktrace. If it's usual ClassNotFound issue then it isn't what I am suspecting.

soumilshah1995 commented 1 year ago

Well i am not sure exactly what logs you need deletes and job was successful but something in breaking due to which i cannot query the data

<html>
<body>
<!--StartFragment-->

# WARNING: Unable to get Instrumentation. Dynamic Attach failed. You may add this JAR as -javaagent manually, or supply -Djdk.attach.allowAttachSelf
--
# WARNING: Unable to attach Serviceability Agent. Unable to attach even with module exceptions: [org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed.]

<!--EndFragment-->
</body>
</html>
xushiyan commented 1 year ago

@lokeshj1703 let's try to reproduce this. @soumilshah1995 is this observed on master version? or 0.12.1?

soumilshah1995 commented 1 year ago

Hi Version of Glue used is 4.0 glue 4.0 natively support HUDI i am not aware behind the scene which version Glue uses for Apache HUDI Here are steps in PDF https://drive.google.com/file/d/1mkED3AUZBARsgeRCzk0K0XMvSyajo7mQ/view https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html

soumilshah1995 commented 1 year ago

Any Updates on this issue ?

soumilshah1995 commented 1 year ago

Any updates @nsivabalan

lokeshj1703 commented 1 year ago

@soumilshah1995 I am looking into the issue. Will be able to provide an update this week.

soumilshah1995 commented 1 year ago

@soumilshah1995 I am looking into the issue. Will be able to provide an update this week.

Copy that :D

lokeshj1703 commented 1 year ago

@soumilshah1995 I tried out the steps posted by you and ran into same issue with Athena after hard deletes.

hard_delete_df = spark.sql("SELECT * FROM $DB.$TABLE_rt")
print(hard_delete_df.show())

But on executing above query in the script, I was able to verify the results after hard delete. The result below is from the Output logs available after script is run.

+-------------------+--------------------+------------------+----------------------+--------------------+------+--------------------+-----------+-----+------+---+-----+----------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|emp_id|       employee_name| department|state|salary|age|bonus|        ts|
+-------------------+--------------------+------------------+----------------------+--------------------+------+--------------------+-----------+-----+------+---+-----+----------+
|  20221220072524704|20221220072524704...|                 1|                      |1440cbb6-2a3d-44d...|     1|        Justin Potts|         HR|   FL| 29051| 50|23794| 182319089|
|  20221220072524704|20221220072524704...|                 9|                      |1440cbb6-2a3d-44d...|     9|      Jonathan Curry|      Sales|   CA| 30139| 38|32531| 553278391|
|  20221220072524704|20221220072524704...|                 5|                      |1440cbb6-2a3d-44d...|     5|      Crystal Bailey|         HR|   CA| 67670| 33|73686| 979521992|
|  20221220072524704|20221220072524704...|                 0|                      |1440cbb6-2a3d-44d...|     0|       Ronald Knight|         IT|   CA|144358| 40|37355| 638291101|
|  20221220072524704|20221220072524704...|                 8|                      |1440cbb6-2a3d-44d...|     8|       Stephen Hayes|         IT|   RJ| 94273| 60|24935| 187328660|
|  20221220072524704|20221220072524704...|                 7|                      |1440cbb6-2a3d-44d...|     7|      Victor Delgado|         HR|   CA| 53309| 56|16874| 629223298|
|  20221220072630012|20221220072630012...|                 3|                      |1440cbb6-2a3d-44d...|     3|this is update on...|      Sales|   RJ| 81000| 30|23000| 827307999|
|  20221220072524704|20221220072524704...|                 2|                      |1440cbb6-2a3d-44d...|     2|        Linda Hughes|      Sales|   RJ| 93822| 36|67099| 715415893|
|  20221220072524704|20221220072524704...|                 6|                      |1440cbb6-2a3d-44d...|     6|        Jerry Thomas|  Marketing|   TX| 72604| 42|61980| 211331663|
|  20221220072614391|20221220072614391...|                11|                      |1440cbb6-2a3d-44d...|    11|                 xxx|      Sales|   RJ| 81000| 30|23000| 827307999|
|  20221220072614391|20221220072614391...|                12|                      |1440cbb6-2a3d-44d...|    12|            x change|Engineering|   RJ| 79000| 53|15000|1627694678|
|  20221220072630012|20221220072630012...|                 3|                      |1440cbb6-2a3d-44d...|     3|this is update on...|      Sales|   RJ| 81000| 30|23000| 827307999|
+-------------------+--------------------+------------------+----------------------+--------------------+------+--------------------+-----------+-----+------+---+-----+----------+

It seems like the issue is with query run using Athena.

lokeshj1703 commented 1 year ago

@soumilshah1995 Can you check the Athena query issue with AWS support?

soumilshah1995 commented 1 year ago

@soumilshah1995 Can you check the Athena query issue with AWS support?

Now that we now its issue can we have someone resolve it this is really important stuff if people want to do hard delete after performing hard delete it breaks on MOR tables

@nsivabalan @bhasudha looks like other have also confirmed about this issue

lokeshj1703 commented 1 year ago

@soumilshah1995 AWS uses its own fork of Hudi and Presto for Athena.

jar tf hudi-presto-bundle-0.12.1.jar | grep -i InstantiatorStrategy
org/apache/hudi/com/esotericsoftware/kryo/Kryo$DefaultInstantiatorStrategy$1.class
org/apache/hudi/com/esotericsoftware/kryo/Kryo$DefaultInstantiatorStrategy$2.class
org/apache/hudi/com/esotericsoftware/kryo/Kryo$DefaultInstantiatorStrategy.class
org/apache/hudi/org/objenesis/strategy/InstantiatorStrategy.class
org/apache/hudi/org/objenesis/strategy/SingleInstantiatorStrategy.class
org/apache/hudi/org/objenesis/strategy/StdInstantiatorStrategy.class
org/apache/hudi/org/objenesis/strategy/BaseInstantiatorStrategy.class
org/apache/hudi/org/objenesis/strategy/SerializingInstantiatorStrategy.class

As you can see, the InstantiatorStrategy classes are shaded in the Hudi Presto bundle. But the error shown by Athena relates to an unshaded class GENERIC_INTERNAL_ERROR: org/objenesis/strategy/InstantiatorStrategy.

soumilshah1995 commented 1 year ago

So what would be solution here ?

soumilshah1995 commented 1 year ago

Any Updates here

lokeshj1703 commented 1 year ago

@soumilshah1995 We are still trying to root cause this w.r.t AWS. Will update here.

soumilshah1995 commented 1 year ago

Roger that captain :D by the way i am big time hudi Lover ;D

soumilshah1995 commented 1 year ago

Any Updates

soumilshah1995 commented 1 year ago

Hi do we have any updates for this Task ?

soumilshah1995 commented 1 year ago

Someone on youtube as well asked me same question

image

Do we have an update ?

Witekkq commented 1 year ago

Have check it with hudi-spark3.3-bundle_2.12-0.12.2.jar the same problem appears, so RT table is not usable anymore, and the row can be still readed in RO table.

soumilshah1995 commented 1 year ago

Any updates on this issue ? @nsivabalan

lokeshj1703 commented 1 year ago

@soumilshah1995 @Witekkq We are discussing with the AWS team and it can take some time to get to a resolution here.

yihua commented 1 year ago

Hi @soumilshah1995 would you mind creating an AWS support issue for this? That will accelerate the resolution from AWS Athena.

soumilshah1995 commented 1 year ago

Sure i will tell my company sysops to create support ticket :D

yihua commented 1 year ago

Sure i will tell my company sysops to create support ticket :D

Appreciate that! Let us know the AWS support ticket number once it's filed. cc @umehrot2

xushiyan commented 1 year ago

Sure i will tell my company sysops to create support ticket :D

Appreciate that! Let us know the AWS support ticket number once it's filed. cc @umehrot2

@soumilshah1995 do you have the support ticket number? cc @umehrot2

soumilshah1995 commented 1 year ago

Hi i think we should @yihua he has already a ticket i assume i did send email to AWS Rep they said they will look into it i assume if many people file tickets then they can prioritize this Issues

rahil-c commented 1 year ago

@soumilshah1995 What was the version of the hudi table?

rahil-c commented 1 year ago

@soumilshah1995 what version of the athena query engine are you using and what version of hudi is the table you wrote too. It seems that the latest version of hudi that athena is using is 0.10.1 for query engine v3. Can you try creating a hudi table with 0.10.1 and make sure that the query engine athena uses is v3, you can a add new workgroup with a v3 engine in athena console. If you need help feel free to ping me on hudi slack.

soumilshah1995 commented 1 year ago

i will retry this weekend and let you know if that works :D stay tuned

soumilshah1995 commented 1 year ago

Test Results

based on Amazon recommendation i have changed to Athena engine 3

image image image

Attempting Hard delete now

image

image

image

image

Conclusion

Upgraded to Athena Engine 3

Hudi market place connector 0.10.1