delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.45k stars 1.67k forks source link

Help to provide an example to show how to store images #1012

Closed journey-wang closed 2 years ago

journey-wang commented 2 years ago

Hello Everyone,

Could you help to show a full python example about how to store many small pictures to delta table? And how to read out them?

Best regards.

jodwyer commented 2 years ago

@journey-wang, I will put together an example around this!

journey-wang commented 2 years ago

@jodwyer Thanks very much for your help. In order to be friendly for new ones, please also show how to run this example .py at their PC.

jodwyer commented 2 years ago

Yes, thanks that's exactly what I was thinking!

journey-wang commented 2 years ago

Hi @jodwyer , any update? Best regards.

jodb commented 2 years ago

@journey-wang I haven't had a chance to work on this yet but I plan to this week.

jodwyer commented 2 years ago

@journey-wang here is an example I put together. I'll put something more formal together but I want to get the example to you to first to see if it's what you are looking for:

import pyspark.sql.functions as fn
import pyspark
from delta import configure_spark_with_delta_pip

builder = (
  pyspark.sql.SparkSession.builder
    .appName('quickstart')
    .config('spark.sql.extensions', 'io.delta.sql.DeltaSparkSessionExtension')
    .config('spark.sql.catalog.spark_catalog', 'org.apache.spark.sql.delta.catalog.DeltaCatalog')
)

spark = configure_spark_with_delta_pip(builder).getOrCreate()

# Flowers dataset from the TensorFlow team - https://www.tensorflow.org/datasets/catalog/tf_flowers
imagePath = "/path/to/flower_photos/"
deltaPath = "/path/to/write/flower_photos_delta_table/"

# read the images from the flowers dataset
images = spark.read.format("binaryFile").\
  option("recursiveFileLookup", "true").\
  option("pathGlobFilter", "*.jpg").\
  load(imagePath)

# Knowing the file path, extract the flower type and filename using substring_index
# Remember, Spark dataframes are immutable, here we are just reusing the images dataframe
images = images.withColumn("flowerType_filename", fn.substring_index(images.path, "/", -2))
images = images.withColumn("flowerType", fn.substring_index(images.flowerType_filename, "/", 1))
images = images.withColumn("filename", fn.substring_index(images.flowerType_filename, "/", -1))
images = images.drop("flowerType_filename")
images.show()

# Select the columns we want to write out to
df = images.select("path", "content", "flowerType", "filename").repartition(4)
df.show()

# Write out the delta table to the given path, this will overwrite any table that is currently there
df.write.format("delta").mode("overwrite").save(deltaPath)

# Reads the delta table that was just written
dfDelta = spark.read.format("delta").load(deltaPath)
dfDelta.show()
jodwyer commented 2 years ago

This issue is resolved in PR #1067

journey-wang commented 2 years ago

Thanks very much @jodwyer

jodb commented 2 years ago

@journey-wang glad to help! We are working on the PR to get this example in the codebase and it should be there soon. Have a great week!