Closed soumilshah1995 closed 10 months ago
previous test i used spark parquet and had launched thrift server using similar technique worked fine
Thrift sever
spark-submit \
--master 'local[*]' \
--conf spark.executor.extraJavaOptions=-Duser.timezone=Etc/UTC \
--conf spark.eventLog.enabled=false \
--conf spark.sql.warehouse.dir=file:///Users/soumilshah/Desktop/soumil/sparkwarehouse \
--packages 'org.apache.spark:spark-sql_2.12:3.4.0,org.apache.spark:spark-hive_2.12:3.4.0' \
--class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 \
--name "Thrift JDBC/ODBC Server" \
--executor-memory 512m
this was just using spark and parquet
{{ config(
materialized='incremental',
file_format='parquet',
incremental_strategy='insert_overwrite',
) }}
SELECT
date(d) AS id,
d AS full_date,
EXTRACT (YEAR FROM d) AS YEAR,
EXTRACT (WEEK FROM d) AS year_week,
EXTRACT (DAY FROM d) AS year_day,
EXTRACT (YEAR FROM d) AS fiscal_year,
EXTRACT (QUARTER FROM d) AS fiscal_qtr,
EXTRACT (MONTH FROM d) AS MONTH,
date_format(d, 'MMMM') AS month_name,
EXTRACT (DOW FROM d) AS week_day,
date_format(d, 'EEEE') AS day_name,
(CASE WHEN date_format(d, 'EEEE') NOT IN ('Sunday', 'Saturday') THEN 0 ELSE 1 END) AS day_is_weekday
FROM (SELECT EXPLODE(months) AS d FROM (SELECT SEQUENCE (TO_DATE('2000-01-01'), TO_DATE('2023-01-01'), INTERVAL 1 DAY) AS months))
DBT project.yml
name: 'sparkdbt'
version: '1.0.0'
config-version: 2
profile: 'sparkdbt'
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
clean-targets:
- "target"
- "dbt_packages"
models:
sparkdbt:
core:
+enabled: true
+materialized: table
dim_date:
+materialized: table
works fine i was able to see table and query it using beeline and dbeaver as well i am seeing issue with Hudi tables not sure what CONF i am missing here
I also have read examples provided https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-dbt/README.md
could not get it to work for some reason
@soumilshah1995 I saw this error before. Can you put your hudi bundle jars in the jars folder and set spark configurations in spark-defaults.yaml. Actually DBT on the runtime doesn't take dependencies/config which you gave in the command to start thrift server.
I explored two pathways in addressing this challenge. The first route involved initiating the process using pure Hive and Spark SQL, coupled with Thrift Server. However, when attempting to run dbt, I encountered a specific issue.
The second route, a more intricate and time-consuming approach, required the installation of Apache Derby and Spark. Despite my efforts, an odd complication arose: when executing "dbt run," it led to the unexpected crash of both Thrift Server and Apache Derby.
# Create virtual environment for DBT
python -m venv dbt-env
source dbt-env/bin/activate
# Install required packages
pip install dbt-core
pip install dbt-spark
pip install 'dbt-spark[PyHive]'
# Navigate to DBT directory
cd ~/.dbt/
# Set Java environment variable
export JAVA_HOME=/opt/homebrew/Cellar/openjdk@11/11.0.21/libexec/openjdk.jdk/Contents/Home
# Step 2: Download and Run Apache Derby
```bash
export DERBY_VERSION=10.14.2.0
curl -O https://archive.apache.org/dist/db/derby/db-derby-$DERBY_VERSION/db-derby-$DERBY_VERSION-bin.tar.gz -P /opt/
tar -xf db-derby-$DERBY_VERSION-bin.tar.gz
export DERBY_HOME=/Users/soumilshah/Desktop/soumil/dbt/db-derby-10.14.2.0-bin
echo $DERBY_HOME
rm -r db-derby-10.14.2.0-bin.tar.gz
$DERBY_HOME/bin/startNetworkServer -h localhost
Step 3: Install Apache Spark
# Specify Spark version
export SPARK_VERSION=3.2.3
# Download and extract Spark
curl -O https://archive.apache.org/dist/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop2.7.tgz
tar -xf spark-$SPARK_VERSION-bin-hadoop2.7.tgz
# Set Spark home
export SPARK_HOME=/Users/soumilshah/Desktop/soumil/dbt/spark-3.2.3-bin-hadoop2.7
echo $SPARK_HOME
# Clean up downloaded files
rm spark-3.2.3-bin-hadoop2.7.tgz
Step 4: Copy JAR files
# Copy JAR files to Spark JARS directory
cp /Users/soumilshah/Desktop/myjar/*.jar $SPARK_HOME/jars/
Step 5: Spark Submit Configuration
# Submit Spark job
spark-submit \
--master 'local[*]' \
--conf spark.executor.extraJavaOptions=-Duser.timezone=Etc/UTC \
--conf spark.eventLog.enabled=false \
--conf spark.sql.warehouse.dir=file:///Users/soumilshah/Desktop/soumil/dbt \
--class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 \
--packages 'org.apache.spark:spark-sql_2.12:3.2.3,org.apache.spark:spark-hive_2.12:3.2.3,org.apache.hudi:hudi-spark3.2-bundle_2.12:0.14.0' \
--name "Thrift JDBC/ODBC Server" \
--executor-memory 5g \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \
--conf hive.metastore.warehouse.dir=/Users/soumilshah/Desktop/soumil/dbt \
--conf hive.metastore.schema.verification=false \
--conf datanucleus.schema.autoCreateAll=true \
--conf javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver \
--conf 'javax.jdo.option.ConnectionURL=jdbc:derby://localhost:1527/MyDatabase;create=true'
Question why do I have to use apache derby I mean when I am simply using Spark SQL and hive server I am able to create Hudi Tables through beeline why does it fails on dbt run
will be creating YouTube videos for this which will help everyone
Hey community,
I hope you're doing well. I recently launched a Thrift server using Spark, incorporating the Hudi library. The server runs smoothly, and I can interact with it using Beeline to query data successfully.
BEELINE
Works fine
INSerted data
DBT debug
Directory
schema.yml
hudi_insert_overwrite_table.sql
dbt_project.yml
DBT run
Any insights or guidance on resolving this issue would be greatly appreciated! If you have any experience with integrating Hudi into Spark Thrift Server and overcoming similar challenges, your expertise would be invaluable.
Thanks in advance for your help!
Regards Soumil