flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.18k stars 550 forks source link

[BUG] Circular Dependency issue in Flyte - pyathena #4855

Open jayanthshimoga opened 5 months ago

jayanthshimoga commented 5 months ago

Describe the bug

Circular Dependency issue in Flyte.

We are getting the below error when we are using pandascursor to get the AWS Athena query result in the flyte workflow, which is only occurring when we use PyAthena >3.0.10. However below python code works fine with other application but doesn't work inside flyte.

Python code works fine inside docker image, when the same image is registered to flyte cluster the code is breaking.

__init__() got an unexpected keyword argument 'connection'

Working:

Not working:

Expected behavior

We are doing an athena query by connecting to AWS and we want the result in the pandas df. We are able to see query getting executing in Athena only issue is while reading it from S3.

Expected behaviour : Reading data from s3 and converting to pandas df. This issue is only occurring inside flyte cluster

Additional context to reproduce

Context: So we are doing an Athena query by connecting to AWS and we want the result in the pandas df. We are able to see query getting executing in Athena only issue is while reading it from S3. So I separately tried to call s3 bucket or run aws commands using python subprocess. I do have access and able fetch the s3 objects. But my assumption is when its trying to fetch from Pyathena connection is getting lost in flyte.

Flow:

  1. Step1: It calls execute function in Pyathena https://github.com/laughingman7743/PyAthena/blob/2d88c3e6172295880d011a5f151bc273b9d577ad/pyathena/pandas/cursor.py#L121C9-L121C16
  2. Step 2: tries to get Athena pandas result set https://github.com/laughingman7743/PyAthena/blob/2d88c3e6172295880d011a5f151bc273b9d577ad/pyathena/pandas/cursor.py#L160
  3. Step 3: AthenaPandasResultSet has as_pandas function which internally calls read_csv function and the error is causing there https://github.com/laughingman7743/PyAthena/blob/2d88c3e6172295880d011a5f151bc273b9d577ad/pyathena/pandas/result_set.py#L295

I don't know why the flyte is unable to establish a connection to s3 when we call from Pyathena.

Screenshots

All the screenshot and sample code is available in https://github.com/jayanthshimoga/flytefit Please use your own AWS access key, Secret key and S3 path.

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

dosubot[bot] commented 5 months ago

🤖

Hey there, @jayanthshimoga! Great to see you back. Hope you're doing well!

Based on the information you've provided, it seems like the issue might be related to the differences in the environment between the Docker image and the Flyte cluster. Here are a few potential causes:

  1. Network Configuration: The Docker image and the Flyte cluster might have different network configurations. This could cause the Docker image to be unable to connect to the Flyte cluster if the network settings are not properly configured.

  2. Environment Variables: The Docker image and the Flyte cluster might have different environment variables set. This could cause the Docker image to be unable to connect to the Flyte cluster if the necessary environment variables are not set correctly.

  3. Software Versions: The Docker image and the Flyte cluster might be running different versions of software. This could cause the Docker image to be unable to connect to the Flyte cluster if the software versions are not compatible.

  4. Access Control: The Docker image and the Flyte cluster might have different access control settings. This could cause the Docker image to be unable to connect to the Flyte cluster if the Docker image does not have the necessary permissions.

In the provided context, the 'connection' argument error could be due to incorrect configuration of the storage connection in the 'flyte-admin-config' ConfigMap. The connection details for the 'minio' storage are provided under the 'storage' key in the ConfigMap. If these details are not correctly set, it could lead to a 'connection' argument error.

Here is the relevant part of the configuration:

storage:
  type: minio
  connection:
    access-key: minio
    auth-type: accesskey
    secret-key: miniostorage
    disable-ssl: true
    endpoint: "http://minio:9000"
    region: my-region

Ensure that the 'access-key', 'secret-key', 'endpoint', and 'region' are correctly set according to your Minio storage configuration.

As for the changes in PyAthena version >3.0.10, I wasn't able to find specific changes that could cause the 'connection' argument error. It might be helpful to check the PyAthena's release notes or changelog for the versions after 3.0.10 to see if there were any changes that could potentially cause this issue.

If the issue persists, could you please provide more details about the error message? Specifically, the full traceback would be very helpful in diagnosing the problem. Also, could you please confirm if the same error occurs with other versions of PyAthena or only with versions greater than 3.0.10?

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

jayanthshimoga commented 4 months ago

Hi Team, Any update on this? it will be very helpful.

pingsutw commented 4 months ago

@jayanthshimoga are you able to share the minimal repro

jayanthshimoga commented 4 months ago

Hi @pingsutw , Please find the repo link here https://github.com/jayanthshimoga/flytefit. Its a very small, quick to setup and easy to recreate the issue.

kumare3 commented 1 month ago

cc @pingsutw / @eapolinario has anyone looked into this?