delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.17k stars 391 forks source link

Cannot write to Minio with deltalake.write_deltalake or Polars #2894

Open rwhaling opened 2 hours ago

rwhaling commented 2 hours ago

Environment

Delta-rs version: 0.20.0

Binding: Python

Environment:


Bug

What happened: Running Minio locally via docker-compose (.yml spec below), attempted to write a 20-row Pyarrow table via the write_deltalake function, and got the opaque error message:

Generic S3 error: Error after 0 retries in 71.583µs, max_retries:10, retry_timeout:180s, source:builder error for url (http://localhost:9000/test-bucket/test_delta_table/_delta_log/_last_checkpoint)

Attempted to write a 20-row pandas dataframe via the polars write_delta function as well, and got the exact same error:

Generic S3 error: Error after 0 retries in 71.583µs, max_retries:10, retry_timeout:180s, source:builder error for url (http://localhost:9000/test-bucket/test_delta_table/_delta_log/_last_checkpoint)

What you expected to happen: I expected to be able to write tables out to Minio via S3. I have tested that I can write to Minio just fine with boto3. I'm happy to do more footwork chasing this down, turning up logging, or reproducing it deeper in the stack if someone can point me in the right direction!

How to reproduce it:

import boto3
import random
import string
import pyarrow as pa
from deltalake import write_deltalake, DeltaTable

# Configuration
endpoint_url = 'http://localhost:9000'
access_key = 'minioadmin'
secret_key = 'minioadmin'
bucket_name = 'test-bucket'
table_name = 'test_delta_table'
num_rows = 10

# Generate random string function
def generate_random_string(length=5):
    return ''.join(random.choices(string.ascii_lowercase, k=length))

# Generate data
keys = [generate_random_string() for _ in range(num_rows)]
values = [generate_random_string() for _ in range(num_rows)]

# Create PyArrow table
table = pa.table([keys, values], names=['key', 'value'])

table_path = f"s3://{bucket_name}/{table_name}"

print(f"Writing Delta table to: {table_path}")

storage_options = {
    "AWS_ACCESS_KEY_ID": access_key,
    "AWS_SECRET_ACCESS_KEY": secret_key,
    "AWS_ENDPOINT_URL": endpoint_url,
    "AWS_REGION": "us-east-1",
    "AWS_S3_ALLOW_UNSAFE_RENAME": "true"
}

try:
    # Check if MinIO is accessible
    s3 = boto3.client('s3', endpoint_url=endpoint_url)
    s3.list_buckets()
    print("Successfully connected to MinIO")

    # Check if the bucket exists
    buckets = s3.list_buckets()['Buckets']
    if not any(bucket['Name'] == bucket_name for bucket in buckets):
        print(f"Bucket {bucket_name} does not exist. Creating it...")
        s3.create_bucket(Bucket=bucket_name)

    # Write to S3
    write_deltalake(
        table_path,
        table,
        mode="overwrite",
        storage_options=storage_options
    )
    print(f"Successfully wrote Delta table to {table_path}")

    # Read and print the table metadata
    dt = DeltaTable(table_path, storage_options=storage_options)
    print(f"Table metadata:\n{dt.metadata()}")
    print(f"Table schema:\n{dt.schema().json()}")
    print(f"Table version: {dt.version()}")

except Exception as e:
    print(f"Error writing Delta table: {e}")

More details: docker-compose.yml:

version: '3.8'

services:
  minio:
    image: minio/minio
    ports:
      - "9000:9000"
      - "9001:9001"
    volumes:
      - minio_data:/data
    environment:
      MINIO_ROOT_USER: minioadmin
      MINIO_ROOT_PASSWORD: minioadmin
    command: server /data --console-address ":9001"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 30s
      timeout: 20s
      retries: 3

volumes:
  minio_data:
ion-elgreco commented 2 hours ago

@rwhaling and this worked in 0.19.x?

rwhaling commented 2 hours ago

@ion-elgreco No idea, doing this for the first time. I can try with 0.19.

rtyler commented 1 hour ago

Thank you for the reproduction case! With a fresh environment I am consistently getting Unable to locate credentials The problem is coming from boto3

My guess is that you may have environment variables set that boto3 is picking up, which are different from what are being passed as storage options into deltalake with

rwhaling commented 1 hour ago

Thank you! I seem to get the same thing on 0.19.2 as well. Let me check out those environment vars. (Yes, I did have the AWS env vars set as well, apologies)

And so I understand - is write_deltalake using boto3 internally? Is there a way for me to turn up the logging?

rtyler commented 1 hour ago
   s3 = boto3.client('s3', aws_access_key_id=access_key, aws_secret_access_key=secret_key, endpoint_url=endpoint_url)  

Gets the repro case to the error message you describe

rtyler commented 1 hour ago

@rwhaling don't worry about trying to reproduce this on older versions, I found the error :smile: It exists going back many versions!

This was a good Sunday morning brain exercise!

The problem here is that the stack is expecting TLS communication. Add AWS_ALLOW_HTTP as "true" to the storage_options and you'll be sorted!

If you're feeling extra thankful, I would love a pull request to update any relevant documentation in the docs/ directory which would have helped you here :pray:

rwhaling commented 1 hour ago

Bingo, it works! I love writing doc PR's, would be happy to - and thank y'all for this great project!