fusedio / udfs

Public Fused UDFs. Build any scale workflows with the Fused Python SDK and Workbench webapp, and integrate them into your stack with the Fused Hosted API.
https://www.fused.io
MIT License
188 stars 36 forks source link

AWS credentials as environment variables not working as expected #191

Open rabernat opened 3 months ago

rabernat commented 3 months ago

I'm trying to load private data from S3 in a fused UDF, and I want to make sure I'm doing it the "right" way.

I'm trying to follow these instructions: https://docs.fused.io/basics/utilities/#environment-variables In one UDF, I've got this:

env_vars = """
AWS_ACCESS_KEY_ID=AK...
AWS_SECRET_ACCESS_KEY=Gt...
"""

# Path to your .env file
env_file_path = '/mnt/cache/.env'

@fused.udf
def udf(bbox=None, n=10):
    # Writing the environment variables to the .env file
    with open(env_file_path, 'w') as file:
        file.write(env_vars)

In the second UDF I've got this.

@fused.udf
def udf():
    import os

    import boto3
    from dotenv import load_dotenv

    # Load environment variable
    env_file_path = '/mnt/cache/.env'
    load_dotenv(env_file_path, override=True)

    # these are being set correctly
    assert os.environ['AWS_ACCESS_KEY_ID'] == 'AK...'
    assert os.environ['AWS_SECRET_ACCESS_KEY'] == 'Gt...'

    # doesn't work
    # botocore.exceptions.ClientError: An error occurred (InvalidToken) when calling the GetObject operation: The provided token is malformed or otherwise invalid.
    # s3 credentials not detected correctly from environment
    # s3 = boto3.client('s3')

    # does work if I explicitly pass the credentials
    s3 = boto3.client(
        's3',
        aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'],
        aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY']
    )

    bucket="arraylake-earthmover-production"
    key="6462e90c27af040cabc066e8/chunks/0081af97634c03fc1c3fcd16b1f3c196558c15c096674f5a0052bf25479d0e8b.00000000000000000000000000000000"
    obj = s3.get_object(Bucket=bucket, Key=key)
    print(obj)

In most normal Python environments, boto3 will automatically get the credentials from the environment variables without having to pass them explicitly (see https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#environment-variables). However, in the fused UDF, this is not working for some reason, and if I don't pass the credentials explicitly, I get the "The provided token is malformed or otherwise invalid" error.

This is obviously not a huge problem. The workaround--explicitly passing the credentials--is easy enough. But I thought I would open this issue to try to understand better what is going on here.

isaacbrodsky commented 3 months ago

I think this is due to our default credentials somehow conflicting with credentials loaded through dotenv. Thanks for reporting that a workaround was needed!

rabernat commented 2 months ago

What are the "default credentials". You're talking about the AWS credentials that are already associated with the environment?

FWIW, I experienced basically the same problem with our Arraylake token environment variable, which couldn't possibly be part of your default credentials.

pgzmnk commented 2 months ago

That's correct. Fused environments have a set of credentials associated with them by default. It would indeed make sense to use different variable names to avoid conflicts.

If you share a reproduceable example of how you intended to use the Arraylake token we can take a look to ensure there's a path forward for all users.