philippmoehler0440 commented 5 years ago

Tell us about your request Currently a repository URI looks like this: <account_id>.dkr.ecr.<region>.amazonaws.com/<repository>. Account ID and region might be movable parts which has negative effects for the following scenarios described. It would be helpful to be able to define an alternate URI for ECR repositories.

Which service(s) is this request for? ECR, (.. and maybe other container services?)

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Our team provides a docker image for ~12 other teams that acts as a build tool for frontend resources within their pipelines. We identified different disaster recovery scenarios where the current ECR URI is a disadvantage:

(1) Unavailability of ECR within the specified region

For this case all teams would have to change the URI of our repository or won`t be able to deploy frontend resources instead, until ECR is reachable again.

(2) Disaster recovery for the ECR account

When the account which contains the ECR repository is (e.g.) compromised or we have to do a complete disaster recovery for other reasons, then all teams would have to change the URI of our repository. This is a dependency we could prevent..

An alternate repository URI could be a fixed interface for other consumers. Changes for account ID or region behind this part would not affect them anymore.

Are you currently working around this issue? One way we had seen to "solve" this, is to use an nginx as reverse proxy for the ECR, but this is an effort we don`t want to practice.

Additional context This topic from January 2016 also describes some pain points with this.

Attachments

none

julienbonastre commented 2 years ago

Can you share the following?

Commands you're using for docker build and tagging of your images; are you tagging with both the ECR URL and your custom URL?

docker build -t myecr.org.com/imagepath:tag .

Commands you're using for docker login: are you using the ECR URL or your custom URL?

aws ecr get-login-password --region <my-region> | docker login -u AWS --password-stdin myecr.org.com

Commands you're using for docker push: are you pushing to the ECR URL or your custom URL?

docker push myecr.org.com/imagepath:tag

Commands you're using for docker pull: are you pulling from the ECR URL or your custom URL?

docker pull myecr.org.com/imagepath:tag

Also I know you described your NGINX configuration above, but can you give us a little more info as to how all of this is setup? Do you have one or more NGINX nodes running with public IPs, listening for TLS on 443, modifying the Host header, and passing to the upstream? Is that all there is to it?

I am using a Private ECR registry in a hub/central account. A fargate service is running an nginx task which listens on 443 and performs https forwarding to the target AWS ECR FQDN, rewriting the Host header as is necessary. This fargate service is behind ALB with an acm for my custom ECR FQDN.

Crux of the nginx.conf that does the heavy lifting:

server { 
  listen 443 ssl http2; 
  listen [::]:443 ssl http2; 
  ssl_certificate /etc/ssl/certs/nginx-selfsigned.crt; 
  ssl_certificate_key /etc/ssl/private/nginx-selfsigned.key; 
  ssl_dhparam /etc/ssl/certs/dhparam.pem;
  chunked_transfer_encoding on; 
  client_max_body_size 0; 
  server_name _;

  location / {
    proxy_pass https://<aws acct id>.dkr.ecr.ap-southeast-2.amazonaws.com;
    proxy_set_header Host "<aws acct id>.dkr.ecr.ap-southeast-2.amazonaws.com"; 
    proxy_set_header X-Real-IP $remote_addr; 
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; 
    proxy_set_header X-Forwarded-Proto "https"; 
    proxy_read_timeout 900; 
  }

That's it.

Note: I couldn't use Http301 redirect rule on an ALB Listener for this. Unsure why but it must not perform a masqueraded proxy quite the same way as it's just providing a 301 back to the docker client as opposed to handing the request transparently to the requesting client. This makes the difference and is why I'm using nginx.

EDIT: If I'm able to get all of this working with CloudFront, ACM, and Lambda, I will (:pray:) publish my code as a Terraform module to the Terraform module registry so that others can do this without any hassle.

Nice idea, I'm considering sanitising the pattern I've used and sharing. It is a golang cdk setup which deploys the Hub infra including acm, alb, fargate cluster with an nginx proxy service and builds out all the desired team repos with custom kms keys per team, repo life cycle and access policies, driven from a configurable set of JSON cfgs (one per business team), and executed with a Github Action for live updates.

mhornbacher commented 2 years ago

@julienbonastre is this truly it? Every endpoint returns a 503 when I attempt to spin this configuration up in a docker container...

julienbonastre commented 2 years ago

@mhornbacher , 💯 ... here is verbatim code used from our full production multi-account spanning centralised ECR solution using a custom FQDN

docker-entrypoint.sh

#!/usr/bin/env sh
set -eu

envsubst '${TARGET_ECR_FQDN}' < /etc/nginx/conf.d/default.conf.template > /etc/nginx/nginx.conf

exec "$@"

Dockerfile

ARG BASE_NGINX_IMAGE=nginx:latest

FROM ${BASE_NGINX_IMAGE}

ARG TARGET_ECR
ARG ECR_FQDN

ENV ECR_FQDN=${ECR_FQDN}
ENV TARGET_ECR_FQDN=${TARGET_ECR}

RUN mkdir -p /etc/ssl/private
RUN chmod 700 /etc/ssl/private

RUN openssl req -x509 -nodes -days 365                 \
    -newkey rsa:2048                                    \
    -keyout /etc/ssl/private/nginx-selfsigned.key       \
    -out /etc/ssl/certs/nginx-selfsigned.crt            \
    -subj "/C=AU/ST=NA/L=NA/O=OrganisationName/CN=${ECR_FQDN}"    

RUN openssl dhparam -out /etc/ssl/certs/dhparam.pem 2048

COPY nginx-template.conf /etc/nginx/conf.d/default.conf.template
COPY docker-entrypoint.sh /
ENTRYPOINT ["/docker-entrypoint.sh"]

CMD ["nginx", "-g", "daemon off;"]

EXPOSE 80 443

nginx-template.conf

user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /run/nginx.pid;

# Load dynamic modules. See /usr/share/doc/nginx/README.dynamic.
include /usr/share/nginx/modules/*.conf;

events {
    worker_connections 1024;
}

http {
    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;

    sendfile            on;
    tcp_nopush          on;
    tcp_nodelay         on;
    keepalive_timeout   65;
    types_hash_max_size 4096;

    include             /etc/nginx/mime.types;
    default_type        application/octet-stream;

    # Load modular configuration files from the /etc/nginx/conf.d directory.
    # See http://nginx.org/en/docs/ngx_core_module.html#include
    # for more information.
    include /etc/nginx/conf.d/*.conf;

# Settings for a TLS enabled server.
#
    server {
        listen       443 ssl http2;
        listen       [::]:443 ssl http2;
        ssl_certificate /etc/ssl/certs/nginx-selfsigned.crt;
        ssl_certificate_key /etc/ssl/private/nginx-selfsigned.key;
        ssl_dhparam /etc/ssl/certs/dhparam.pem;
#        ssl_session_cache shared:SSL:1m;
#        ssl_session_timeout  10m;
        chunked_transfer_encoding on;
        client_max_body_size 0;
        server_name     _;

        ########################################################################
        # from https://cipherli.st/                                            #
        # and https://raymii.org/s/tutorials/Strong_SSL_Security_On_nginx.html #
        ########################################################################

        ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
        ssl_prefer_server_ciphers on;
        ssl_ciphers "EECDH+AESGCM:EDH+AESGCM:AES256+EECDH:AES256+EDH";
        ssl_ecdh_curve secp384r1;
        ssl_session_cache shared:SSL:10m;
        ssl_session_tickets off;
        ssl_stapling on;
        ssl_stapling_verify on;
        resolver 8.8.8.8 8.8.4.4 valid=300s;
        resolver_timeout 5s;
        # Disable preloading HSTS for now.  You can use the commented out header line that includes
        # the "preload" directive if you understand the implications.
        #add_header Strict-Transport-Security "max-age=63072000; includeSubdomains; preload";
        add_header Strict-Transport-Security "max-age=63072000; includeSubdomains";
        add_header X-Frame-Options DENY;
        add_header X-Content-Type-Options nosniff;

        ##################################
        # END https://cipherli.st/ BLOCK #
        ##################################

        location / {
                proxy_pass              https://${TARGET_ECR_FQDN};
                proxy_set_header        Host                "${TARGET_ECR_FQDN}";
                proxy_set_header        X-Real-IP           $remote_addr;
                proxy_set_header        X-Forwarded-For     $proxy_add_x_forwarded_for;
                proxy_set_header        X-Forwarded-Proto   "https";
                proxy_read_timeout      900;
        }
    }
}

And here is the excerpt from golang CDK app that builds out the ECS Fargate cluster and references the above-mentioned Docker component (image):

func buildEcrProxyApp(
    stack awscdk.Stack,
    vpc awsec2.IVpc,
    ecrFqdn *string,
    stackProps awscdk.StackProps,
    proxySG awsec2.SecurityGroup,
    privateSubnets awsec2.SubnetSelection,
    proxyTargetGrp awselasticloadbalancingv2.ApplicationTargetGroup,
    taskExecRole awsiam.Role,
) {
    fargateCluster := awsecs.NewCluster(
        stack,
        jsii.String("EcrProxyCluster"),
        &awsecs.ClusterProps{
            ClusterName: jsii.String(
                fmt.Sprintf("%v", *stackProps.StackName)),
            //ContainerInsights:              nil,
            EnableFargateCapacityProviders: jsii.Bool(true),
            Vpc:                            vpc,
        })

    targetEcrFqdn := jsii.String(fmt.Sprintf("%v.dkr.ecr.%v.amazonaws.com",
        *stackProps.Env.Account,
        *stackProps.Env.Region))

    proxyImage := awsecrassets.NewDockerImageAsset(
        stack,
        jsii.String("EcrProxyDockerImage"),
        &awsecrassets.DockerImageAssetProps{
            BuildArgs: &map[string]*string{
                "ECR_FQDN":   ecrFqdn,
                "TARGET_ECR": targetEcrFqdn,
            },
            Directory: jsii.String("../../ecr-proxy"),
        })

    proxyTaskDef := awsecs.NewFargateTaskDefinition(
        stack,
        jsii.String("EcrProxyTaskDefinition"),
        &awsecs.FargateTaskDefinitionProps{
            ExecutionRole:  taskExecRole,
            Family:         jsii.String("ecr-proxy"),
            Cpu:            jsii.Number(256),
            MemoryLimitMiB: jsii.Number(512),
        })

    awsecs.NewContainerDefinition(
        stack,
        jsii.String("EcrProxyContainerDefinition"),
        &awsecs.ContainerDefinitionProps{
            Image:         awsecs.ContainerImage_FromDockerImageAsset(proxyImage),
            ContainerName: jsii.String("nginx-task"),
            Logging: awsecs.NewAwsLogDriver(
                &awsecs.AwsLogDriverProps{
                    StreamPrefix: jsii.String("ecs-fargate"),
                    LogGroup: awslogs.NewLogGroup(
                        stack,
                        jsii.String("EcrProxyLogGroup"),
                        &awslogs.LogGroupProps{
                            //EncryptionKey: nil,
                            LogGroupName:  jsii.String("/ecs-fargate/ecr-proxy"),
                            RemovalPolicy: awscdk.RemovalPolicy_DESTROY,
                            Retention:     awslogs.RetentionDays_SIX_MONTHS,
                        }),
                    //LogRetention:     "",
                    Mode: awsecs.AwsLogDriverMode_NON_BLOCKING,
                    //MultilinePattern: nil,
                }),
            //MemoryLimitMiB:                nil,
            PortMappings: &[]*awsecs.PortMapping{
                {
                    ContainerPort: jsii.Number(443),
                    Protocol:      awsecs.Protocol_TCP,
                },
            },
            TaskDefinition: proxyTaskDef,
        })

    proxyService := awsecs.NewFargateService(
        stack,
        jsii.String("EcrProxyService"),
        &awsecs.FargateServiceProps{
            Cluster:              fargateCluster,
            DesiredCount:         jsii.Number(2),
            EnableECSManagedTags: jsii.Bool(true),
            //HealthCheckGracePeriod:     awscdk.Duration_Seconds(jsii.Number(0)),
            MaxHealthyPercent: jsii.Number(200),
            MinHealthyPercent: jsii.Number(100),
            PropagateTags:     awsecs.PropagatedTagSource_TASK_DEFINITION,
            ServiceName:       jsii.String("ecr-proxy"),
            TaskDefinition:    proxyTaskDef,
            AssignPublicIp:    jsii.Bool(false),
            PlatformVersion:   awsecs.FargatePlatformVersion_LATEST,
            SecurityGroups:    &[]awsec2.ISecurityGroup{proxySG},
            VpcSubnets:        &privateSubnets,
        })

    proxyService.AttachToApplicationTargetGroup(proxyTargetGrp)
}

I really HTH

Where are you receiving a 503 from? An ALB? or the docker container itself? If so it could just be an nginx config issue...

WhyNotHugo commented 2 years ago

Note: I couldn't use Http301 redirect rule on an ALB Listener for this. Unsure why but it must not perform a masqueraded proxy quite the same way as it's just providing a 301 back to the docker client as opposed to handing the request transparently to the requesting client. This makes the difference and is why I'm using nginx.

301 is a redirection. That is, it returns message to the client saying "please use this alternative URL instead". What you want is to forward traffic, which is what nginx is doing.

You should be able to use ALB or even CloudFront instead of nginx. CF would likely be cheaper too.

julienbonastre commented 2 years ago

You should be able to use ALB or even CloudFront instead of nginx. CF would likely be cheaper too.

I hear you and in theory this sounds great and would be very open to other options, if I could see any evidence or method demonstrating how these would do this. 😜

When you mention ALB, what exact method are you proposing?
The only documented options are forwarding to a target group/s, fixed-response or redirection response (301/302). Is there something I'm missing?

As for CF, the notion sounds good, however I haven't yet seen/ heard of a way to host/present a CF CDN internally only to a private network environment? All doco I see points to it being a public-only facing ingress solution.

Again, please let me know if I've grossly overlooked something here.

Thanks @WhyNotHugo , whilst I'm quite satisfied with our current solution I'm always keen to find a better, cheaper or smarter way to refactor for sure 👌🏼🙏🤗🚀

naftulikay commented 2 years ago

While the NGINX approach seems to be working great for everyone, I am still unable to get the CloudFront Lambda@Edge solution working after investing many, many late nights and weekends.

The other extremely helpful posters above detailed what they needed to do, and in reality, it's fairly simple: rewrite the Host header before sending the request upstream to the private ECR registry, and set typical proxy headers as you would normally do.

I have CloudFront with ECR specified as an HTTPS origin, and a Lambda@Edge function for the origin-request and origin-response events. During a request, the following events occur in CloudFront for Lambda@Edge:

viewer-request: the actual request from the viewer
origin-request: the request to the origin from the CDN, this is where we need to modify the headers
origin-response: the response obtained from the origin
viewer-response: the response as it is sent to the viewer

I have installed Lambda@Edge functions at the origin request level to modify Host, and on the origin response level to see what the hell is happening. It has been a nightmare to debug this. I had to write a custom CLI tool in Rust to monitor multiple log streams in multiple AWS regions, because Lambda@Edge can run in any region so you have to tail all of them. I've even enabled debug logging on my local Docker daemon and found nothing interesting. If I could MITM my local Docker daemon and CloudFront, I'd then be able to see the actual HTTP response to try to discover what is really going on.

Login and pull work, so clearly authorization in working, but push fails with unauthorized: authentication required. I'm convinced at this point that CloudFront is doing something wrong, so I'm probably going to have to do something even more absurd and put an NGINX node between CloudFront and ECR to see what the hell is going wrong.

I think it's an issue with the CloudFront origin request policy, the cache policy, and perhaps the cache behavior. Hopefully I'll regain some fortitude soon to try untangling this nonsense again. If I can get it working, I'll publish my Terraform so that anyone will be able to use this without running any servers. I might publish my Terraform regardless just so that others can audit it. I wish I had pro support on my account, but it's a personal account and I can't pay for that.

SebastianUA commented 2 years ago

Developed a workaround to pulling images from AWS ECR through AWS Route53 CNAME: https://medium.com/@solo.metalisebastian/pulling-pushing-out-any-aws-ecr-images-from-to-aws-ecr-through-aws-route53-cname-7c92307f9c25

naftulikay commented 2 years ago

Developed a workaround to pulling images from AWS ECR through AWS Route53 CNAME: https://medium.com/@solo.metalisebastian/pulling-pushing-out-any-aws-ecr-images-from-to-aws-ecr-through-aws-route53-cname-7c92307f9c25

So you're using a self-signed cert or just ignoring cert validity for the domain, right?

SebastianUA commented 2 years ago

@naftulikay I skip warnings of certificate validation in Python script 😀

naftulikay commented 2 years ago

@naftulikay I skip warnings of certificate validation in Python script 😀

Yes, that's what I was confirming.

Ultimately the best solution would be one from AWS similar to how they do it in CloudFront, allow multiple domain names to be passed, and allow a custom ACM certificate. That would solve all of this. If they did that, no more hacks on our part, no need for Lambda@Edge, no need for hosting NGINX, only billing from the Route 53, ACM, and ECR services themselves. Here's to remaining hopeful for an AWS native solution for this.

Meanwhile I will try to get testing again soon. I'll probably set up Lambda@Edge functions for all request phases, setup NGINX as well, between the edge and origin, and see if there is a way to get ECR logs if possible, and write a tailer in Rust to track the full request lifecycle from CloudWatch.

Let me publish my Terraform as a public repo so that others can tinker as well.

SebastianUA commented 2 years ago

@naftulikay I skip warnings of certificate validation in Python script 😀

Yes, that's what I was confirming.

Ultimately the best solution would be one from AWS similar to how they do it in CloudFront, allow multiple domain names to be passed, and allow a custom ACM certificate. That would solve all of this. If they did that, no more hacks on our part, no need for Lambda@Edge, no need for hosting NGINX, only billing from the Route 53, ACM, and ECR services themselves. Here's to remaining hopeful for an AWS native solution for this.

Meanwhile I will try to get testing again soon. I'll probably set up Lambda@Edge functions for all request phases, setup NGINX as well, between the edge and origin, and see if there is a way to get ECR logs if possible, and write a tailer in Rust to track the full request lifecycle from CloudWatch.

Let me publish my Terraform as a public repo so that others can tinker as well.

I'm going to publish helm-chart with Nginx proxy to resolve this issue soon. I would like to play around and test several solutions and pick up optimal one.

thiagolsfortunato commented 2 years ago

Any news?

naftulikay commented 2 years ago

Okay everybody, I have published my Terraform/Lambda@Edge function at naftulikay/terraform-aws-private-ecr-domain. I would appreciate any and all help from the community toward arriving at something that works. docker login and docker pull are working, docker push fails for some reason, and I can't find the failed request in the CloudWatch logs.

If anyone can see and report any issues in my CloudFront configuration in Terraform, I would really appreciate it :pray:

I'd love to see us arrive at a serverless solution for using any custom domain name with your private ECR registry, and we're really close, just need some CloudFront expertise and some debugging to figure out why docker push isn't working.

drop-rahul commented 2 years ago

I am a bit late to this conversation, but my ngnix pattern is failing too on push. I am using EKS to run the nginx pods behind a nodeport service, exposed by an alb ingress and fronted by a route53 entry in our org's private zone. It just hangs in waiting state, even though it retries but nothing happens. I am using the following settings in my nginx- location / { proxy_pass https://AWS-ACCOUNT-ID.dkr.ecr.AWS-REGION.amazonaws.com; proxy_set_header Host "AWS-ACCOUNT-ID.dkr.ecr.AWS-REGION.amazonaws.com"; proxy_set_header Authorization "Basic HARDCODED-FOR-POC"; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto "https"; proxy_read_timeout 900; } The pod ends up on a node which has the right set of iam policies

Any pointers on what am I missing> @naftulikay @julienbonastre @community_who_got_this working #

naftulikay commented 2 years ago

So @WhyNotHugo filed an issue with my repository, I had forgotten to include the Lambda JavaScript code, but it is now present in master, so feel free to poke around and see if there are any glaring problems with my code. The Lambda does function as intended, so I suspect the issue is in the CloudFront configuration, as everything works except docker push.

@drop-rahul I don't know why your code is not working, but mine does not attempt to do any authorization directly, it simply passes things through. The login token you are including will expire every 12 hours or so, so it's probably best to just pass this along without trying to set it in NGINX, but do whatever you see as fit.

The basic idea of this is that all you really need to do is update the Host header to match what you're doing: ${AWS_ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com. My Lambda does things a bit more intelligently, in that CloudFront passes the target origin as part of the event data, and I use the hostname from there, so my Lambda should work regardless of which account you use it in.

drop-rahul commented 2 years ago

Adding /v2/ makes push work from a local machine doing docker push localhost:8080/reponame:tag; but the same settings do not work when the nginx pod is inside k8s behind an ALB. After looking at logs of both k8s nginx pod and docker container running on localhost i see that the docker push from localhost fires a lot of post, put and patch requests whereas the one fired from nginx pod inside k8s only fires put and post. Not sure if this is the cause of push not working from an nginx pod behind a nodeport service using aws alb ingress.

naftulikay commented 2 years ago

@drop-rahul if you figure out the exact HTTP requests that occur when you docker login, docker pull, and docker push, that would greatly help my debugging to know what those paths are and what requests/responses look like.

drop-rahul commented 2 years ago

@naftulikay Here are some logs from a failed attempt to push from the pod -

`10.6.54.121 - - [15/Dec/2021:02:49:00 +0000] "HEAD /v2/org/repo/blobs/sha256:548ea99daea2fc939b874ea5a0b3f78b79eddfea63e14f84e6614c3ab39aa2b8 HTTP/1.1" 404 0 "-" "docker/20.10.8 go/go1.16.6 git-commit/75249d8 kernel/5.10.47-linuxkit os/linux arch/arm64 UpstreamClient(Docker-Client/20.10.8 \x5C(darwin\x5C))" "45.678.123.45"

10.6.54.121 - - [15/Dec/2021:02:49:00 +0000] "HEAD /v2/org/repo/blobs/sha256:cbf31660838a34a4e8608604ce4607d0ab97fbfb624a32575c939a27b61ddfd0 HTTP/1.1" 404 0 "-" "docker/20.10.8 go/go1.16.6 git-commit/75249d8 kernel/5.10.47-linuxkit os/linux arch/arm64 UpstreamClient(Docker-Client/20.10.8 \x5C(darwin\x5C))" "45.678.123.45"

10.6.54.121 - - [15/Dec/2021:02:49:00 +0000] "HEAD /v2/org/repo/blobs/sha256:55e25883642300e513b84ed15f2a98f4057c8199707147c17844478c925feb5c HTTP/1.1" 404 0 "-" "docker/20.10.8 go/go1.16.6 git-commit/75249d8 kernel/5.10.47-linuxkit os/linux arch/arm64 UpstreamClient(Docker-Client/20.10.8 \x5C(darwin\x5C))" "45.678.123.45"

10.6.54.121 - - [15/Dec/2021:02:49:01 +0000] "POST /v2/org/repo/blobs/uploads/ HTTP/1.1" 202 5 "-" "docker/20.10.8 go/go1.16.6 git-commit/75249d8 kernel/5.10.47-linuxkit os/linux arch/arm64 UpstreamClient(Docker-Client/20.10.8 \x5C(darwin\x5C))" "45.678.123.45"`

/org/repo is the name of my ecr repository. 10.6.54.121 is lb ip and 45.678.123.45 is my IP from where I was trying to push. All redacted ofcourse.

drop-rahul commented 2 years ago

Update - pulling images using custom domain works from inside a k8s cluster, but push fails. If I mimic the same set up on local machine using container then push also works. There is one difference as seen in the logs - the local nginx container throws a warning that request body has been buffered at /var/../.. location and thats why it starts to fire patch command and completes the push.

My eks. set up - nginx running as a deployment fronted by a node port service fronted by aws ingress creating an internal load balancer with an entry in my private zone on route 53.

My local set up - nginx running on local pointing to the route53 which is backed by an internet facing lb talking to the same nodeport service in eks. Internet facing lb due to some networking complications on VPN.

WhyNotHugo commented 2 years ago

On Wed, 15 Dec 2021, at 04:27, Naftuli Kay wrote:

@drop-rahul https://github.com/drop-rahul if you figure out the exact HTTP requests that occur when you docker login, docker pull, and docker push, that would greatly help my debugging to know what those paths are and what requests/responses look like.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aws/containers-roadmap/issues/299#issuecomment-994257457, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFSNOZGMMKGG57FUGA24U3URADJBANCNFSM4HN7TGDA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

I recommend using mitm-proxy for this kind of debugging.

Configure docker to use it as a proxy and you should see all details for all requests [and responses].

naftulikay commented 2 years ago

@WhyNotHugo yes, my plan is to set up mitmproxy on my local machine and to do so in the module as well so I can record exactly what is going on, when I get a chance. Fingers crossed that I can do this soon.

drop-rahul commented 2 years ago

Got it working finally using nginx using openresty/openresty:1.15.8.3-2-alpine. Here is the working conf - `user nginx; worker_processes 1;

events { worker_connections 1024; }

http { include mime.types; default_type application/octet-stream;

keepalive_timeout 65; sendfile on;

proxy_cache_path /cache/cache levels=1:2 keys_zone=cache:16m inactive=1y max_size=CACHE_MAX_SIZE use_temp_path=off; resolver RESOLVER valid=30s;

This is necessary for us to be able to disable request buffering in all cases

proxy_http_version 1.1;

will run before forking out nginx worker processes

init_by_lua_block { require "cjson" }

https://docs.docker.com/registry/recipes/nginx/#setting-things-up

map $upstream_http_docker_distribution_api_version $docker_distribution_api_version { '' 'registry/2.0'; }

access_log /dev/stdout ; error_log /dev/stderr;

server { listen 5000 default_server;

Cache

add_header X-Cache-Status   $upstream_cache_status;
proxy_temp_path /cache/temp 1 2;
proxy_ignore_headers        Cache-Control;

# disable any limits to avoid HTTP 413 for large image uploads
client_max_body_size 0;

# required to avoid HTTP 411: see Issue #1486 (https://github.com/moby/moby/issues/1486)
chunked_transfer_encoding on;

# increases timeouts to avoid HTTP 504
proxy_connect_timeout  3s;
proxy_read_timeout     300s;
proxy_send_timeout     300s;
send_timeout           300s;

# disable proxy request buffering
proxy_request_buffering off;

add_header 'Docker-Distribution-Api-Version' $docker_distribution_api_version always;
add_header "Access-Control-Allow-Origin" "*";

# health check
location /healthz {
        return 200;
}

location / {
  set $url        UPSTREAM;
  proxy_pass      $url;
  proxy_redirect  $url SCHEME://$host:PORT;

  # Add AWS ECR authentication headers
  proxy_set_header  X-Real-IP          $remote_addr;
  proxy_set_header  X-Forwarded-For    $remote_addr;
  proxy_set_header  X-Forwarded-User   "Basic $http_authorization";
  proxy_set_header  Authorization      "Basic $http_authorization";
  proxy_set_header  X-Forwarded-Proto  $scheme;

}

# Content addressable files like blobs.
# https://docs.docker.com/registry/spec/api/#blob
location ~ ^/v2/.*/blobs/[a-z0-9]+:[a-f0-9]+$ {
  set $url        UPSTREAM;
  proxy_pass      $url;
  proxy_redirect  $url SCHEME://$host:PORT;

  # Add AWS ECR authentication headers
  proxy_set_header  X-Real-IP          $remote_addr;
  proxy_set_header  X-Forwarded-For    $remote_addr;
  proxy_set_header  X-Forwarded-User   "Basic $http_authorization";
  proxy_set_header  Authorization      "Basic $http_authorization";
  proxy_set_header  X-Forwarded-Proto  $scheme;

  # When accessing image blobs using HTTP GET AWS ECR redirects with
  # s3 buckets uri to download the image. This needs to handled by
  # nginx rather then docker client for caching.
  proxy_intercept_errors    on;
  error_page 301 302 307 =  @handle_redirect;

}

# No authentication headers needed as ECR returns s3 uri with details in
# query params. Also the params should be part of cache key for nginx to
# issue HIT for same image blob.
location @handle_redirect {
  set                    $saved_redirect_location '$upstream_http_location';
  proxy_pass             $saved_redirect_location;
  proxy_cache            cache;
  proxy_cache_key        CACHE_KEY;
  proxy_cache_valid      200  1y;
  proxy_cache_use_stale  error timeout invalid_header updating
                         http_500 http_502 http_503 http_504;
  proxy_cache_lock       on;
}

location ~ ^/v2/.*/.*/tags/list+$ {
  # get paginated list of tags
  content_by_lua_block {
    local location, tags, cjson = ngx.var.uri, {}, require "cjson"
    while true do
      local res = ngx.location.capture("/get_tags",
          { args = { req_uri = location } }
      )
      if res.status == ngx.HTTP_NOT_FOUND and table.getn(tags) == 0 then
         ngx.status = ngx.HTTP_NOT_FOUND
         ngx.print(res.body)
         ngx.exit(0)
      end
      local data = cjson.decode(res.body)
      for _,v in ipairs(data['tags']) do
        table.insert(tags, v)
      end
      if res.header["Link"] ~= nil then
        location = res.header["Link"]:match("/v2[^>]+")
      else
        ngx.print(cjson.encode{name = data['name'], tags = tags })
        ngx.exit(ngx.HTTP_OK)
      end
    end
  }
}

# Helper location for getting tags from upstream repository
# used for getting paginated tags.
location /get_tags {
  internal;
  set_unescape_uri      $req_uri $arg_req_uri;
  proxy_pass            UPSTREAM$req_uri;

  # Add AWS ECR authentication headers
  proxy_set_header  X-Real-IP          $remote_addr;
  proxy_set_header  X-Forwarded-For    $remote_addr;
  proxy_set_header  X-Forwarded-User   "Basic $http_authorization";
  proxy_set_header  Authorization      "Basic $http_authorization";
  proxy_set_header  X-Forwarded-Proto  $scheme;

}

} } `

nomatterz commented 2 years ago

@julienbonastre

How did you manage to make things work using ALB only?

For now I've managed to make it work using nginx (just nginx installed on public ec2) But if I use ALB with redirect rule i'm getting Error: error logging into "ecr.example.com": invalid username/password while trying to login

@julienbonastre if you already have an ALB set up, you should be able to edit its listener rule to have the default action redirect to the ecr address, without needing the additional nginx box

Um, ok.. @jmchuster , Yes.. I can.. WTAH... I definitely recall trying this originally and obviously correcting the passed Host >header to the target ECR FQDN and for some reason it didn't seem to be happy...

However I just attempted it again, and yes, it is working fine for auth/pull/push....

This is clearly a much better approach and less infrastructure required! I'm confused now as to why this didn't work for me >initially or what pushed me down the direction of using nginx to do the Host header rewrite......... :scratches head:

Anyway. Awesome! I will refactor now and make this even cleaner!

julienbonastre commented 2 years ago

@julienbonastre

How did you manage to make things work using ALB only?

For now I've managed to make it work using nginx (just nginx installed on public ec2) But if I use ALB with redirect rule i'm getting Error: error logging into "ecr.example.com": invalid username/password while trying to login

@julienbonastre if you already have an ALB set up, you should be able to edit its listener rule to have the default action redirect to the ecr address, without needing the additional nginx box

Um, ok.. @jmchuster , Yes.. I can.. WTAH... I definitely recall trying this originally and obviously correcting the passed Host >header to the target ECR FQDN and for some reason it didn't seem to be happy...

However I just attempted it again, and yes, it is working fine for auth/pull/push....

This is clearly a much better approach and less infrastructure required! I'm confused now as to why this didn't work for me >initially or what pushed me down the direction of using nginx to do the Host header rewrite......... :scratches head:

Anyway. Awesome! I will refactor now and make this even cleaner!

Correct @nomatterz , ALB alone will not perform the necessary Host header rewrite required, thus returning me to the nginx proxy solution...

It works beautifully 😎🚀🙌

nomatterz commented 2 years ago

Thank you for prompt reply @julienbonastre! So there is no way to have ALB only... I thought you managed to do it somehow with ALB as single proxy

sumanthkumarc commented 2 years ago

+1 this would definitely help while migrating from custom registries to ECR.

NandGates commented 1 year ago

Okay everybody, I have published my Terraform/Lambda@Edge function at naftulikay/terraform-aws-private-ecr-domain. I would appreciate any and all help from the community toward arriving at something that works. docker login and docker pull are working, docker push fails for some reason, and I can't find the failed request in the CloudWatch logs.

If anyone can see and report any issues in my CloudFront configuration in Terraform, I would really appreciate it 🙏

I'd love to see us arrive at a serverless solution for using any custom domain name with your private ECR registry, and we're really close, just need some CloudFront expertise and some debugging to figure out why docker push isn't working.

@naftulikay I just used your module today (with some very minor, insignificant updates) and it worked perfectly - including pushing! You should try again, no idea what AWS may have changed but I can confirm me and my team can push using the friendly URL now!!

naftulikay commented 1 year ago

@NandGates oh my goodness, I'll have to try it out now! Thank you for reporting it working!

n1ngu commented 1 year ago

Worth reading https://httptoolkit.com/blog/docker-image-registry-facade/

TL;DR, redirecting (307) from your.domain.tld to any registry seems to solve docker pull usage without the need to reverse-proxy the registry.

Via https://github.com/docker/hub-feedback/issues/2314#issuecomment-1473876653, thanks @pimterry for sharing.

amancevice commented 1 year ago

I just tested & published a terraform module that seems to do the trick using the 307-redirect method above.

I also included a bash script wrapper that can be used as a docker credential helper to authenticate with ECR without needing to do a docker login.

docker push seems to work!

edlevin6612 commented 1 year ago

Thank you @amancevice ! I successfully factored your module into my Terraform and I am able to do a custom ECR domain from multiple regions via DNS latency records.

Now the other issue I have is how could Kubernetes/EKS authenticate using a custom domain? Based on how k8s authentication to ECR is handled, there is domain parsing that takes place to look at various parts (e.g. region) which makes, sense because you need the region to get the token.

So I would have to treat ECR as a third-party private registry however credentials expire every 12hrs so then there is a need to keep them refreshed and so this becomes more trouble than its worth. I guess I will +1 the FR for native ECR custom domain support with our AWS TAM. :(

wosiu commented 1 year ago

@amancevice do you think it could work with static page served from S3 instead of lambda for executing those redirects? And just configure this static page in route53?

amancevice commented 1 year ago

@wosiu I don't think you can return custom response codes from S3 like that so I don't think that would work. What's wrong with Lambda out of curiosity?

naftulikay commented 11 months ago

@NandGates I finally got around to testing and updating my module, which has gone through a big refactor and is now at version 0.4.4. I still can't push. What changes did you make to get things working? I'd love to update the module so that everyone can use it. I could incorporate the @amancevice solution, which is to build an API Gateway REST API which redirects every endpoint with a 307, but I'd like to see if it's possible within CloudFront.

devunt commented 9 months ago

Finally got it worked with following nginx.conf.

      events {}
      http {
        server {
          listen 80 default;
          chunked_transfer_encoding on; 
          client_max_body_size 0; 

          location / {
            proxy_set_header Host <account_id>.dkr.ecr.<region>.amazonaws.com;
            proxy_set_header X-Forwarded-Host $host;
            proxy_pass https://<account_id>.dkr.ecr.<region>.amazonaws.com;
          }
        }
      }

Push and pull from custom domain works like a charm.

naftulikay commented 9 months ago

FYI everyone, I have published a Rust Docker image for Lambda for both amd64 and arm64 that does the rewrite as quickly as possible 😄 Additionally, I've done the same thing with a CloudFlare worker, again in Rust. Hope this helps people!

Docker Image for Lambda

GitHub repository is naftulikay/lambda-ecr-rewrite and the Docker images are available from GitHub's image registry:

amd64 👉 ghcr.io/naftulikay/lambda-ecr-rewrite-amd64:latest
arm64 👉 ghcr.io/naftulikay/lambda-ecr-rewrite-arm64:latest

These are published as separate images because as of right now, Lambda breaks if trying to use a multi-plaform Docker image. This is at least true as of a month or two ago when I tried it.

Set the ECR_REGISTRY_HOST environment variable on the Lambda function to the hostname of your ECR registry, e.g. 123456789012.dkr.ecr.us-east-1.amazonaws.com.

These images are to be used as a catch-all API Gateway integration for the host you'd like to alias. Essentially just setup an API Gateway and send all requests to a Lambda function running this Docker image. It will just return a 307 redirect for all paths, which will transparently use ECR and will not proxy directly through API Gateway. This ensures you don't pay for data transfer twice.

CloudFlare Worker

Additionally, if you're using CloudFlare, I wrote a simple Rust CloudFlare worker which does the same thing. Sharing this isn't easy, but here is the actual source code:

use std::sync::OnceLock;
use worker::*;

const DEFAULT_ACCOUNT_ID: &str = "<REPLACE_WITH_ACCOUNT_ID>";
const DEFAULT_REGISTRY_REGION: &str = "<REPLACE_WITH_ECR_REGION>";

const ECR_REGISTRY_HOSTNAME_ENV_VAR: &str = "ECR_REGISTRY_HOSTNAME";

static ECR_REGISTRY_HOSTNAME: OnceLock<String> = OnceLock::new();

#[event(fetch)]
async fn main(req: Request, env: Env, _ctx: Context) -> Result<Response> {
    Response::redirect_with_status(
        create_redirect_url(
            &req.url().expect("unable to parse incoming request url"),
            env.var(ECR_REGISTRY_HOSTNAME_ENV_VAR).ok(),
        ),
        307,
    )
}
fn create_redirect_url(url: &Url, var: Option<Var>) -> Url {
    let mut u = url.clone();
    u.set_scheme("https").expect("unable to set url scheme");
    u.set_host(Some(get_ecr_registry_hostname(var)))
        .expect("unable to set url host");
    u
}

fn get_ecr_registry_hostname(var: Option<Var>) -> &'static str {
    ECR_REGISTRY_HOSTNAME.get_or_init(|| match var.map(|v| v.to_string()) {
        Some(v) => v,
        None => format!("{DEFAULT_ACCOUNT_ID}.dkr.ecr.{DEFAULT_REGISTRY_REGION}.amazonaws.com"),
    })
}

In your deployment, set the ECR_REGISTRY_HOSTNAME environment variable on the worker to the FQDN of your ECR registry, e.g. 123456789012.dkr.ecr.us-east-1.amazonaws.com.

Similarly to the Lambda, this just returns a 307 redirect on all paths.

relistan commented 8 months ago

This is still a non-starter for ECR for me. Because of how Docker images are tagged and used, we have to have a fixed domain we use for our images, not one that will change. The ergonomics of the ECR hostnames are also appalling. Please let us use Route53-hosted custom domains.

julienbonastre commented 8 months ago

Finally got it worked with following nginx.conf.

      events {}
      http {
        server {
          listen 80 default;
          chunked_transfer_encoding on; 
          client_max_body_size 0; 

          location / {
            proxy_set_header Host <account_id>.dkr.ecr.<region>.amazonaws.com;
            proxy_set_header X-Forwarded-Host $host;
            proxy_pass https://<account_id>.dkr.ecr.<region>.amazonaws.com;
          }
        }
      }

Push and pull from custom domain works like a charm.

Yep.. right on @devunt .. we've had this working like a dream for a little while now 😁 #nginxFTW

https://github.com/aws/containers-roadmap/issues/299#issuecomment-1005667775

jessefarinacci commented 8 months ago

we've had this working like a dream for a little while now 😁 #nginxFTW

Hi julienbonastre, thank you for that code block. Can you demonstrate the docker login type logic with the vanity DNS name though? Are you leveraging docker-credential-ecr-login? I expect it may be unnecessary with this long lived server process solution.

I have a serverless solution (Cloudfront + ACM -> API Gateway -> VTI req/res munging to redirect) that works nicely and cost efficiently once logged in, but can't seem to handle that login piece.. (fallback to login to account-region-dkr hostname) :)

amancevice commented 8 months ago

@jessefarinacci check out my terraform module here. The linked section in the README talks through how I use the docker credential plugin feature. You can review the shell script in the bin dir as well. You shouldn't have to use my backend solution for the docker credential helper side of things to help with local authentication.

julienbonastre commented 8 months ago

we've had this working like a dream for a little while now 😁 #nginxFTW

Hi julienbonastre, thank you for that code block. Can you demonstrate the docker login type logic with the vanity DNS name though? Are you leveraging docker-credential-ecr-login? I expect it may be unnecessary with this long lived server process solution.

I have a serverless solution (Cloudfront + ACM -> API Gateway -> VTI req/res munging to redirect) that works nicely and cost efficiently once logged in, but can't seem to handle that login piece.. (fallback to login to account-region-dkr hostname) :)

Yes, of course @jessefarinacci : https://github.com/aws/containers-roadmap/issues/299#issuecomment-947148191

jampy commented 4 months ago

Worth reading https://httptoolkit.com/blog/docker-image-registry-facade/

TL;DR, redirecting (307) from your.domain.tld to any registry seems to solve docker pull usage without the need to reverse-proxy the registry.

Does push (and all other commands) work with this solution?

benjimin commented 3 months ago

So the current options seem to be:

Proxy it using NGINX (or any equivalent self-managed proxy service), which can potentially also replace authentication,
Proxy it using CloudFront (officially unsupported), or
Redirect to it using a 307 service (e.g. Lambda / API Gateway).

Also, most of these are still complicated by the need for an ECR-specific helper process to assume an IAM role to generate short-lived docker credentials for the client to use. (AWS provide a credential helper application but it doesn't directly support these use-cases.)

webertrlz commented 2 months ago

+11111111111111111111111111

aws / containers-roadmap

[ECR] [request]: support custom domains, or alternate URIs for repositories #299

This is necessary for us to be able to disable request buffering in all cases

will run before forking out nginx worker processes

https://docs.docker.com/registry/recipes/nginx/#setting-things-up

Cache

Docker Image for Lambda

CloudFlare Worker