Closed soumilshah1995 closed 1 year ago
@n3nash @bvaradar
@soumilshah1995 : This looks like the correct package is not added to classpath. Can you check if org.apache.hudi.keygen.SimpleKeyGenerator is present in the jar being passed.
Hi @soumilshah1995 The bottom exception complains that Property hoodie.datasource.write.partitionpath.field not found
. You need to specify the partition path field with --hoodie-conf hoodie.datasource.write.partitionpath.field=<partition_column>
.
does this looks okay to you @bvaradar | @yihua
if i am missing anything can you please point me out i am new to delta streamer :D
spark-submit
--master yarn
--deploy-mode cluster
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer /usr/lib/hudi/hudi-utilities-bundle.jar
--table-type COPY_ON_WRITE
--source-ordering-field replicadmstimestamp
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource
--target-base-path s3://sql-server-dms-demo/hudi/public/sales
--target-table invoice
--payload-class org.apache.hudi.common.model.AWSDmsAvroPayload
--hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
--hoodie-conf hoodie.datasource.write.recordkey.field=invoiceid
--hoodie-conf hoodie.datasource.write.partitionpath.field=destinationstate
--hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://sql-server-dms-demo/raw/public/sales/
Please correct me if I'm doing something incorrectly. I've attached base files for your reference. I see hudi folders being created, but I don't see any base files (Parquet files) being created. any idea why ?
also i just have two files on S3 which i generated via DMS and i am trying out delta streamer
Links to Base files https://drive.google.com/drive/folders/1BwNEK649hErbsWcYLZhqCWnaXFX3mIsg?usp=share_link
@yihua @bvaradar i just fired job
as i was Explaining even after 10 minutes i dont see hudi base files i do see meta data folder created by hudi but base files are not generated and there are only two files on S3 should not take this long ? am i doing something wrong here :D
Do you see any exceptions in logs? Also are you writing to this base path first time? Is your .hoodie folder empty? @soumilshah1995
@yihua @pratyakshsharma First of all, I want to thank you for taking time out of your busy schedule to assist me with this matter. Yes i cancelled the job since i am performing the lab from AWS Account to avoid charges the job was in running state from past 10 minutes and i could not see any Base Files. Yes i am writing to base path for first time. also there are only two small files to process i assume it should not take this long
if you all can tell me what am i missing or how can i resolve error that would be great please let me know if you need any further details happy to provide you. attaching screenshot of S3 which shows hudi folder was created but base file was not created
Links to parquet file can be found https://drive.google.com/drive/folders/1BwNEK649hErbsWcYLZhqCWnaXFX3mIsg?usp=share_link
Even after running for 10 minutes i dont see base files
i have stopped job and try to run it again then i am seeing following
issue has been resolved Daniel Ford was very kind to help on slack i will post detailed video about delta streamer on Youtube Channel
spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \
--conf spark.sql.hive.convertMetastoreParquet=false \
--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory \
--master yarn \
--deploy-mode client \
--deploy-mode cluster \
--executor-memory 1g \
--driver-memory 2g \
/usr/lib/hudi/hudi-utilities-bundle.jar \
--table-type COPY_ON_WRITE \
--op UPSERT \
--source-ordering-field replicadmstimestamp \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--target-base-path s3://sql-server-dms-demo/hudi/public/sales \
--target-table invoice \
--hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator \
--hoodie-conf hoodie.datasource.write.recordkey.field=invoiceid \
--hoodie-conf hoodie.datasource.write.partitionpath.field=destinationstate \
--hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://sql-server-dms-demo/raw/public/sales \
--hoodie-conf hoodie.datasource.write.precombine.field=replicadmstimestamp
Do you mind sharing what was the issue? @soumilshah1995
Do you mind sharing what was the issue? @soumilshah1995
umm sure i think i was missing some config
These are config that worked for me
spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \
--conf spark.sql.hive.convertMetastoreParquet=false \
--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory \
--master yarn \
--deploy-mode client \
--deploy-mode cluster \
--executor-memory 1g \
/usr/lib/hudi/hudi-utilities-bundle.jar \
--table-type COPY_ON_WRITE \
--op UPSERT \
--enable-sync \
--source-ordering-field replicadmstimestamp \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--target-base-path s3://delta-streamer-demo-hudi/raw/public/sales \
--target-table invoice \
--payload-class org.apache.hudi.common.model.AWSDmsAvroPayload \
--hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator \
--hoodie-conf hoodie.datasource.write.recordkey.field=invoiceid \
--hoodie-conf hoodie.datasource.write.partitionpath.field=destinationstate \
--hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://delta-streamer-demo-hudi/raw/public/sales \
--hoodie-conf hoodie.datasource.write.precombine.field=replicadmstimestamp \
--hoodie-conf hoodie.database.name=hudidb_raw \
--hoodie-conf hoodie.datasource.hive_sync.enable=true \
--hoodie-conf hoodie.datasource.hive_sync.database=hudidb_raw \
--hoodie-conf hoodie.datasource.hive_sync.table=tbl_invoices \
--hoodie-conf hoodie.datasource.hive_sync.partition_fields=destinationstate
@soumilshah1995 Just got to this again. So it looks like the preCombine field config (hoodie.datasource.write.precombine.field
) was missing in your original configs, causing the job failure. If you have the driver logs, you should see an exception because of this.
You are right I was missing that )
On Thu, Mar 30, 2023 at 6:53 PM Y Ethan Guo @.***> wrote:
@soumilshah1995 https://github.com/soumilshah1995 Just got to this again. So it looks like the preCombine field config ( hoodie.datasource.write.precombine.field) was missing in your original configs, causing the job failure. If you have the driver logs, you should see an exception because of this.
— Reply to this email directly, view it on GitHub https://github.com/apache/hudi/issues/8309#issuecomment-1491068636, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMF5PZ6GHJ2UQU2BZRJVOTW6YFGZANCNFSM6AAAAAAWKSFH24 . You are receiving this because you were mentioned.Message ID: @.***>
-- Thanking You, Soumil Nitin Shah
B.E in Electronic MS Electrical Engineering MS Computer Engineering +1-646 204 5957
Video Tutorials
Part 1: Project Overview : https://www.youtube.com/watch?v=D9L0NLSqC1s
Part 2: Aurora Setup : https://youtu.be/HR5A6iGb4LE
Part 3: https://youtu.be/rnyj5gkIPKA
Part 4: https://youtu.be/J1xvPIcDIaQ
How to setup EMR cluster with VPC https://www.youtube.com/watch?v=-e1-Zsk17Ss&t=4s
PDF guide https://drive.google.com/file/d/1Hj_gyZ8o-wFf4tqTYZXOHNJz2f3ARepn/view
rds.logical_replication 1
wal_sender_timeout 300000
once done please apply these to database and reboot your Database
Run python run.py
try:
import os
import logging
from functools import wraps
from abc import ABC, abstractmethod
from enum import Enum
from logging import StreamHandler
import uuid
from datetime import datetime, timezone
from random import randint
import datetime
import sqlalchemy as db
from faker import Faker
import random
import psycopg2
import psycopg2.extras as extras
from dotenv import load_dotenv
load_dotenv(".env")
except Exception as e:
raise Exception("Error: {} ".format(e))
class Logging:
"""
This class is used for logging data to datadog an to the console.
"""
def __init__(self, service_name, ddsource, logger_name="demoapp"):
self.service_name = service_name
self.ddsource = ddsource
self.logger_name = logger_name
format = "[%(asctime)s] %(name)s %(levelname)s %(message)s"
self.logger = logging.getLogger(self.logger_name)
formatter = logging.Formatter(format, )
if logging.getLogger().hasHandlers():
logging.getLogger().setLevel(logging.INFO)
else:
logging.basicConfig(level=logging.INFO)
global logger
logger = Logging(service_name="database-common-module", ddsource="database-common-module",
logger_name="database-common-module")
def error_handling_with_logging(argument=None):
def real_decorator(function):
@wraps(function)
def wrapper(self, *args, **kwargs):
function_name = function.__name__
response = None
try:
if kwargs == {}:
response = function(self)
else:
response = function(self, **kwargs)
except Exception as e:
response = {
"status": -1,
"error": {"message": str(e), "function_name": function.__name__},
}
logger.logger.info(response)
return response
return wrapper
return real_decorator
class DatabaseInterface(ABC):
@abstractmethod
def get_data(self, query):
"""
For given query fetch the data
:param query: Str
:return: Dict
"""
def execute_non_query(self, query):
"""
Inserts data into SQL Server
:param query: Str
:return: Dict
"""
def insert_many(self, query, data):
"""
Insert Many items into database
:param query: str
:param data: tuple
:return: Dict
"""
def get_data_batch(self, batch_size=10, query=""):
"""
Gets data into batches
:param batch_size: INT
:param query: STR
:return: DICT
"""
def get_table(self, table_name=""):
"""
Gets the table from database
:param table_name: STR
:return: OBJECT
"""
class Settings(object):
"""settings class"""
def __init__(
self,
port="",
server="",
username="",
password="",
timeout=100,
database_name="",
connection_string="",
collection_name="",
**kwargs,
):
self.port = port
self.server = server
self.username = username
self.password = password
self.timeout = timeout
self.database_name = database_name
self.connection_string = connection_string
self.collection_name = collection_name
class DatabaseAurora(DatabaseInterface):
"""Aurora database class"""
def __init__(self, data_base_settings):
self.data_base_settings = data_base_settings
self.client = db.create_engine(
"postgresql://{username}:{password}@{server}:{port}/{database}".format(
username=self.data_base_settings.username,
password=self.data_base_settings.password,
server=self.data_base_settings.server,
port=self.data_base_settings.port,
database=self.data_base_settings.database_name
)
)
self.metadata = db.MetaData()
logger.logger.info("Auroradb connection established successfully.")
@error_handling_with_logging()
def get_data(self, query):
self.query = query
cursor = self.client.connect()
response = cursor.execute(self.query)
result = response.fetchall()
columns = response.keys()._keys
data = [dict(zip(columns, item)) for item in result]
cursor.close()
return {"statusCode": 200, "data": data}
@error_handling_with_logging()
def execute_non_query(self, query):
self.query = query
cursor = self.client.connect()
cursor.execute(self.query)
cursor.close()
return {"statusCode": 200, "data": True}
@error_handling_with_logging()
def insert_many(self, query, data):
self.query = query
print(data)
cursor = self.client.connect()
cursor.execute(self.query, data)
cursor.close()
return {"statusCode": 200, "data": True}
@error_handling_with_logging()
def get_data_batch(self, batch_size=10, query=""):
self.query = query
cursor = self.client.connect()
response = cursor.execute(self.query)
columns = response.keys()._keys
while True:
result = response.fetchmany(batch_size)
if not result:
break
else:
items = [dict(zip(columns, data)) for data in result]
yield items
@error_handling_with_logging()
def get_table(self, table_name=""):
table = db.Table(table_name, self.metadata,
autoload=True,
autoload_with=self.client)
return {"statusCode": 200, "table": table}
class DatabaseAuroraPycopg(DatabaseInterface):
"""Aurora database class"""
def __init__(self, data_base_settings):
self.data_base_settings = data_base_settings
self.client = psycopg2.connect(
host=self.data_base_settings.server,
port=self.data_base_settings.port,
database=self.data_base_settings.database_name,
user=self.data_base_settings.username,
password=self.data_base_settings.password,
)
@error_handling_with_logging()
def get_data(self, query):
self.query = query
cursor = self.client.cursor()
cursor.execute(self.query)
result = cursor.fetchall()
columns = [column[0] for column in cursor.description]
data = [dict(zip(columns, item)) for item in result]
cursor.close()
_ = {"statusCode": 200, "data": data}
return _
@error_handling_with_logging()
def execute(self, query, data):
self.query = query
cursor = self.client.cursor()
cursor.execute(self.query, data)
self.client.commit()
cursor.close()
return {"statusCode": 200, "data": True}
@error_handling_with_logging()
def get_data_batch(self, batch_size=10, query=""):
self.query = query
cursor = self.client.cursor()
cursor.execute(self.query)
columns = [column[0] for column in cursor.description]
while True:
result = cursor.fetchmany(batch_size)
if not result:
break
else:
items = [dict(zip(columns, data)) for data in result]
yield items
@error_handling_with_logging()
def insert_many(self, query, data):
self.query = query
cursor = self.client.cursor()
extras.execute_batch(cursor, self.query, data)
self.client.commit()
cursor.close()
return {"statusCode": 200, "data": True}
class Connector(Enum):
ON_AURORA_PYCOPG = DatabaseAurora(
data_base_settings=Settings(
port="5432",
server="XXXXXXXXX",
username="postgres",
password="postgres",
database_name="postgres",
)
)
def main():
helper = Connector.ON_AURORA_PYCOPG.value
import time
states = ("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN",
"IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
"NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA",
"WA", "WV", "WI", "WY")
shipping_types = ("Free", "3-Day", "2-Day")
product_categories = ("Garden", "Kitchen", "Office", "Household")
referrals = ("Other", "Friend/Colleague", "Repeat Customer", "Online Ad")
try:
query = """
CREATE TABLE public.sales (
invoiceid INTEGER,
itemid INTEGER,
category TEXT,
price NUMERIC(10,2),
quantity INTEGER,
orderdate DATE,
destinationstate TEXT,
shippingtype TEXT,
referral TEXT
);
"""
helper.execute_non_query(query=query,)
time.sleep(2)
except Exception as e:
print("Error",e)
try:
query = """
ALTER TABLE execute_non_query.sales REPLICA IDENTITY FULL
"""
helper.execute(query=query)
time.sleep(2)
except Exception as e:
pass
for i in range(0, 100):
item_id = random.randint(1, 100)
state = states[random.randint(0, len(states) - 1)]
shipping_type = shipping_types[random.randint(0, len(shipping_types) - 1)]
product_category = product_categories[random.randint(0, len(product_categories) - 1)]
quantity = random.randint(1, 4)
referral = referrals[random.randint(0, len(referrals) - 1)]
price = random.randint(1, 100)
order_date = datetime.date(2016, random.randint(1, 12), random.randint(1, 28)).isoformat()
invoiceid = random.randint(1, 20000)
data_order = (invoiceid, item_id, product_category, price, quantity, order_date, state, shipping_type, referral)
query = """INSERT INTO public.sales
(
invoiceid, itemid, category, price, quantity, orderdate, destinationstate,shippingtype, referral
)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)"""
helper.insert_many(query=query, data=data_order)
main()
{ "CsvRowDelimiter": "\n", "CsvDelimiter": ",", "BucketFolder": "raw", "BucketName": "XXXXXXXXXXXXX", "CompressionType": "NONE", "DataFormat": "parquet", "EnableStatistics": true, "IncludeOpForFullLoad": true, "TimestampColumnName": "replicadmstimestamp", "DatePartitionEnabled": false }
#### Note add this as well in Extra connection attribute
![image](https://user-images.githubusercontent.com/39345855/228972148-10726c19-678b-4d77-a607-77fd7eebb105.png)
parquetVersion=PARQUET_2_0;
### Step 4: create Task add following settings
{ "rules": [ { "rule-type": "selection", "rule-id": "861743510", "rule-name": "861743510", "object-locator": { "schema-name": "public", "table-name": "sales" }, "rule-action": "include", "filters": [] } ] }
# Create EMR cluster and fire the delta streamer
* Note you can download parquert files https://drive.google.com/drive/folders/1BwNEK649hErbsWcYLZhqCWnaXFX3mIsg?usp=share_link
* these are sample data files generated from DMS you can directly copy into your S3 for learning purposes
spark-submit \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \ --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \ --conf spark.sql.hive.convertMetastoreParquet=false \ --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory \ --master yarn \ --deploy-mode client \ --executor-memory 1g \ /usr/lib/hudi/hudi-utilities-bundle.jar \ --table-type COPY_ON_WRITE \ --op UPSERT \ --enable-sync \ --source-ordering-field replicadmstimestamp \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --target-base-path s3://delta-streamer-demo-hudi/hudi/public/invoice \ --target-table invoice \ --payload-class org.apache.hudi.common.model.AWSDmsAvroPayload \ --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator \ --hoodie-conf hoodie.datasource.write.recordkey.field=invoiceid \ --hoodie-conf hoodie.datasource.write.partitionpath.field=destinationstate \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://delta-streamer-demo-hudi/raw/public/sales \ --hoodie-conf hoodie.datasource.write.precombine.field=replicadmstimestamp \ --hoodie-conf hoodie.database.name=hudidb_raw \ --hoodie-conf hoodie.datasource.hive_sync.enable=true \ --hoodie-conf hoodie.datasource.hive_sync.database=hudidb_raw \ --hoodie-conf hoodie.datasource.hive_sync.table=tbl_invoices \ --hoodie-conf hoodie.datasource.hive_sync.partition_fields=destinationstate
Hello All firstly thank you very much for all help from community. i would want to mention i am new to delta streamer i have worked a lot with Glue jobs and i want to experiment with delta streamer so i can make videos and teach the community
i have setup complete pipeline from AWS Aurora Postgres > DMS > S3 and i have EMR cluster 6.9 with Spark 3
Attaching links for sample parquet files and sample json how it looks like
Link to data files https://drive.google.com/drive/folders/1BwNEK649hErbsWcYLZhqCWnaXFX3mIsg?usp=share_link
Here is how i submit jobs
Error i get
References
https://cwiki.apache.org/confluence/display/HUDI/2020/01/20/Change+Capture+Using+AWS+Database+Migration+Service+and+Hudi
https://aws.amazon.com/blogs/big-data/apply-record-level-changes-from-relational-databases-to-amazon-s3-data-lake-using-apache-hudi-on-amazon-emr-and-aws-database-migration-service/
https://github.com/apache/hudi/issues/2406