DougTrajano / mlflow-server

MLflow Tracking Server with basic auth deployed in AWS App Runner.
https://gallery.ecr.aws/t9j8s4z8/mlflow
Apache License 2.0
34 stars 17 forks source link

Problem with MLFlow server reading S3 bucket #168

Closed aduverger closed 2 years ago

aduverger commented 2 years ago

Hello Douglas,

First of all thank you for this repo, I find the idea fantastic :) I just tried the full worklow on AWS and I have an issue with logging artifact.

When I'm trying to log an artifact with mlflow.log_artifact('path_to_some_pkl', '/models') , the pikle files are all saved within the S3 bucket that is linked to the MLFlow server. But I can't access them within the MLFlow UI: image

I've been looking into the logs of the AppRunner, and it seems that MLFlow Server can't access the S3 bucket :

urllib3.exceptions.ConnectTimeoutError: (<botocore.awsrequest.AWSHTTPSConnection object at 0x7ff865bc0970>, 'Connection to <S3_BUCKET_NAME>.s3.amazonaws.com timed out. (connect timeout=60)')

I'm not very familiar with AppRunner and VPC.. Would you have any idea of why this happens ? I changed the S3 bucket to Public Access (just to see if it changed anything), but the same thing happens. So it seems to be linked to how the AppRunner access the S3.

Thanks a lot for your help !

DougTrajano commented 2 years ago

Hi @aduverger

Thank you! I'm very happy with your comment. :)

I figured out this issue a few weeks ago, it happens because the VPC should have a "VPC Endpoint" registered to Amazon S3.

How do you run the terraform apply? Have you informed your VPC ID or created a new one?

If you already have a VPC, please add a VPC Endpoint to Amazon S3, see the code here:

https://github.com/DougTrajano/mlflow-server/blob/f5b876d97e65e3ff0e926f5186c09f3b435d06df/terraform/network.tf#L52-L63

You can also check if the VPC Endpoints exists and create it manually.

aduverger commented 2 years ago

Hi @DougTrajano ,

Thank you for the quick reply !

I let the variables vpc_id and vpc_security_group_ids as default, so terraform apply created a new VPC. I checked on AWS and the VPC Endpoint has also been created, linked to the VPC. There's also the route table and the subnet. So it seems that all the family is here :/

DougTrajano commented 2 years ago

Hi @DougTrajano ,

Thank you for the quick reply !

I let the variables vpc_id and vpc_security_group_ids as default, so terraform apply created a new VPC. I checked on AWS and the VPC Endpoint has also been created, linked to the VPC. There's also the route table and the subnet. So it seems that all the family is here :/

Sorry for the late answer.

So, I would like to request you delete the infrastructure, update your repo clone, and let's create it again.

Please, save the terraform apply log to help me understand how resources were created.

portega-inbrain commented 1 year ago

@aduverger, where you able to fix this? I'm having the same issue with the default values.