Open grossamit opened 3 years ago
@grossamit What are you running when you see this error? Where is the error coming from? Do you see this error all the time or just intermittently?
@tomfaulhaber it happens intermittently . please note that I'm specifying VPCs and Subnet lists during my run using the VpcConfig and NetworkConfig. I get these errors a lot. The weird issue is that if you wait for approx. 40min it recovers until happens again. Full error: Failed (ClientError: Failed to download data. ListObjectsV2 failed for s3://aws-emr-resources-406095609952-us-east-1/dataAccess/amit.gross@logz.io/papermill_input/searchVariables_MP_extractMaping_prod.ipynb-2021-12-31-09-26-33.ipynb, nextToken:[null]: Unable to execute request to S3)
I'm running the notebook also with parameters and instance type.
@grossamit My guess is that this is an issue with way you're routing connections from your SageMaker Processing node to your VPC. One thing would be to check that your subnet definitions are right, your security groups don't have fixed IPs, or whether there's anything else that could mess things up based on what IP address that SageMaker Processing instance is given.
@tomfaulhaber thanks for your reply ! I believe that if it would be the case ,than it would not work constantly. Currently I have ~30% success :-( I do not have fixed IPs . I'll play a little with the subnets.
@grossamit I would expect exactly this behavior if, for example, you had a VPC with multiple subnets but only enabled the S3 endpoint for a single subnet.
Playing with this today, we realized there's an interaction between processing jobs and VPCs that's working differently than I understood. I think we can come up with a workaround.
Hi @tomfaulhaber Could you share your findings? We face the same issue with Processing job. We run job in private subnets with NetworkConfig:
"NetworkConfig": {
"EnableInterContainerTrafficEncryption": false,
"EnableNetworkIsolation": false/true, # (we tried both)
"VpcConfig": {
"SecurityGroupIds": [
"sg-xxx"
],
"Subnets": [
"subnet-xxx",
"subnet-xxx",
"subnet-xxx"
]
}
}
But it can't access bucket with input data:
sagemaker.exceptions.UnexpectedStatusException: Error for Processing job my-processing-job: Failed.
Reason: ClientError: Failed to download data.
ListObjectsV2 failed for s3://my-bucket/input-data/, nextToken:[null]:
Unable to execute request to S3
Hi @tomfaulhaber , Any progress with this? It really makes the solution unreliable. Anything I can help?
Any updates on this? Exact same issue
For me,
I need to run my sagemaker processing job within a VPC and within a subnet, I'm specifying the subnet and VPC like such:
--extra '{ "NetworkConfig": { "EnableInterContainerTrafficEncryption": false, "EnableNetworkIsolation": false, "VpcConfig": { "SecurityGroupIds": [ "sg-xxxxxx" ], "Subnets": [ "subnet-xxxxxx" ] } } }'
However I get a s3.listobject failure as soon as I use it. I need to operate within a vpc/subnet with an IP range to connect to another service as well.
Hi @tomfaulhaber , Any progress with this? It really makes the solution unreliable. Anything I can help?
So I think you need to create a VPC enpoint. For some reason processing jobs doesn't have access to aws internal services despite being inside your VPC/Subnet, having an ARN and role. You need to create a VPC endpoint, which is kind of like a pipe that allows aws sagemaker processing jobs direct access to specific internal services.
Would probably be a good thing to add to the script, hah.
Also experiencing this. Any updates?
I ended up switching back to no VPC after a few tries and realized that my IAM roles were slightly off. I only had the bucket arn with after it when I needed to add just the bucket name with no after it. Like as follows:
{
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::bucket", <----- WAS MISSING THIS
"arn:aws:s3:::bucket/**"
]
},
I'll update it if I get it working with the VPC
ClientError: Failed to download data. ListObjectsV2 failed for s3://.... nextToken:[null]: Unable to execute request to S3
The thing is that sometimes it succeed and sometimes not. I've also added a code to wait 10sec after the notebook upload and verify that the file exists after the upload with ListObjectsV2.
I got the same error, but then I remove all the network config in my processing job. And it works !
Any update here?, I created a S3 VPC endpoint but still giving me that error. I'm using training jobs in a isolated subnet
@gabriel-loka I got same problem, I solved to allow 443 port to Security Group of connection.
Facing this problem. Any update ?
I had the same error. In my case I preferred not to have a NAT GW, thus I used the public access option when I configured the domain in Sagemaker. Following a suggestion to create a VPC Endpoint for S3 solved this problem for me.
Thanks @papierGaylard ;)
ClientError: Failed to download data. ListObjectsV2 failed for s3://.... nextToken:[null]: Unable to execute request to S3
The thing is that sometimes it succeed and sometimes not. I've also added a code to wait 10sec after the notebook upload and verify that the file exists after the upload with ListObjectsV2.