delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Apache License 2.0
7.45k stars 1.67k forks source link

Help to provide an example to show how to store images #1012

Closed journey-wang closed 2 years ago

journey-wang commented 2 years ago

Hello Everyone,

Could you help to show a full python example about how to store many small pictures to delta table? And how to read out them?

Best regards.

jodwyer commented 2 years ago

@journey-wang, I will put together an example around this!

journey-wang commented 2 years ago

@jodwyer Thanks very much for your help. In order to be friendly for new ones, please also show how to run this example .py at their PC.

jodwyer commented 2 years ago

Yes, thanks that's exactly what I was thinking!

journey-wang commented 2 years ago

Hi @jodwyer , any update? Best regards.

jodb commented 2 years ago

@journey-wang I haven't had a chance to work on this yet but I plan to this week.

jodwyer commented 2 years ago

@journey-wang here is an example I put together. I'll put something more formal together but I want to get the example to you to first to see if it's what you are looking for:

import pyspark.sql.functions as fn
import pyspark
from delta import configure_spark_with_delta_pip

builder = (
    .config('spark.sql.extensions', '')
    .config('spark.sql.catalog.spark_catalog', '')

spark = configure_spark_with_delta_pip(builder).getOrCreate()

# Flowers dataset from the TensorFlow team -
imagePath = "/path/to/flower_photos/"
deltaPath = "/path/to/write/flower_photos_delta_table/"

# read the images from the flowers dataset
images ="binaryFile").\
  option("recursiveFileLookup", "true").\
  option("pathGlobFilter", "*.jpg").\

# Knowing the file path, extract the flower type and filename using substring_index
# Remember, Spark dataframes are immutable, here we are just reusing the images dataframe
images = images.withColumn("flowerType_filename", fn.substring_index(images.path, "/", -2))
images = images.withColumn("flowerType", fn.substring_index(images.flowerType_filename, "/", 1))
images = images.withColumn("filename", fn.substring_index(images.flowerType_filename, "/", -1))
images = images.drop("flowerType_filename")

# Select the columns we want to write out to
df ="path", "content", "flowerType", "filename").repartition(4)

# Write out the delta table to the given path, this will overwrite any table that is currently there

# Reads the delta table that was just written
dfDelta ="delta").load(deltaPath)
jodwyer commented 2 years ago

This issue is resolved in PR #1067

journey-wang commented 2 years ago

Thanks very much @jodwyer

jodb commented 2 years ago

@journey-wang glad to help! We are working on the PR to get this example in the codebase and it should be there soon. Have a great week!