SUSE / Portus

Authorization service and frontend for Docker registry (v2)
http://port.us.org/
Apache License 2.0
3k stars 471 forks source link

Portus Background Process gets 404 on Image Tag, but doesn't delete it from Portus.. #2281

Closed diranged closed 4 years ago

diranged commented 4 years ago

Description

We recently upgraded to Portus 2.4.3 w/ Docker Registry 2.7.1. We have ~100k in old image tags we need to purge and we've discovered that due to how the Registry process handles DELETE calls, its wildly inefficient to delete the images that way.

Instead, we want to follow "method 2" in https://medium.com/better-programming/cleanup-your-docker-registry-ef0527673e3a. So we ran a test.

We manually deleted ~5000 old image tags matching the name phab-XXXX-XXXXX from our S3 bucket and then let the portus background process try to do its thing. Unfortunately it seems that the process is discovering that the tag is gone, but is not making any calls back to Portus to delete them from the DB:

Steps to reproduce

  1. Delete the tag/manifest from S3
    aws s3 rm s3://XXX/prod_v2/docker/registry/v2/repositories/${REPO}/${NAME}/_manifests/revisions/sha256/${DIGEST}/link
    aws s3 rm --recursive s3://XXX/prod_v2/docker/registry/v2/${REPO}/_manifests/tags/${TAG_NAME}/
  2. Wait
  3. In our logs, every hour we see the following set of log messages:
    date,Syslog Program,message
    2020-02-03T16:44:32.561Z,,tag: phab-30677-92977
    2020-02-03T16:44:32.561Z,,Error on request: get https://hub.XXX/v2/XXX/XXX/manifests/phab-30677-92977
    2020-02-03T16:44:32.561Z,docker-portus-background,tag: phab-30677-92977
    2020-02-03T16:44:32.561Z,docker-portus-background,Error on request: get https://hub.XXX/v2/XXX/XXX/manifests/phab-30677-92977
    2020-02-03T16:44:32.561Z,docker-portus-background,[catalog] Could not fetch manifest for 'XXX/XXX' with tag 'phab-30677-92977':
    2020-02-03T16:44:32.560Z,,[catalog] Could not fetch manifest for 'XXX/XXX' with tag 'phab-30677-92977':
    2020-02-03T16:44:32.560Z,,"52.21.41.87 - - [03/Feb/2020:16:44:32 +0000] ""GET /v2/XXX/XXX/manifests/phab-30677-92977 HTTP/1.1"" 404 198 ""-"" ""Ruby"""
    2020-02-03T16:44:32.560Z,,"172.18.0.5 - - [03/Feb/2020:16:44:32 +0000] ""GET /v2/XXX/XXX/manifests/phab-30677-92977 HTTP/1.0"" 404 198 """" ""Ruby"""
    2020-02-03T16:44:32.560Z,docker-nginx,"52.21.41.87 - - [03/Feb/2020:16:44:32 +0000] ""GET /v2/XXX/XXX/manifests/phab-30677-92977 HTTP/1.1"" 404 198 ""-"" ""Ruby"""
    2020-02-03T16:44:32.559Z,docker-registry,"172.18.0.5 - - [03/Feb/2020:16:44:32 +0000] ""GET /v2/XXX/XXX/manifests/phab-30677-92977 HTTP/1.0"" 404 198 """" ""Ruby"""
    2020-02-03T16:44:32.541Z,,"s3aws.GetContent(""/docker/registry/v2/repositories/XXX/XXX/_manifests/tags/phab-30677-92977/current/link"")"
    2020-02-03T16:44:32.541Z,docker-registry,"s3aws.GetContent(""/docker/registry/v2/repositories/XXX/XXX/_manifests/tags/phab-30677-92977/current/link"")"
    • Expected behavior: I would have expected that upon getting a 404 for a specific image/tag combo, the portus-background process would have written into the DB (or issued a web call to the portus endpoint) to remove the tag from the database. Clearly we have set the update strategy to update-delete.

Deployment information

Deployment method: We're using the Docker Compose example - but managing it with Puppet. Overall its a portus, portus-background, registry and redis container on a single host with an nginx container acting as a reverse proxy.

Configuration:

You can get this information like this:

Here is the portusinfo:

[Mailer config] Host:     hub.XXX
[Mailer config] Protocol: https://
  User Exists (0.3ms)  SELECT  1 AS one FROM `users` WHERE `users`.`username` = 'portus' LIMIT 1
  User Load (0.3ms)  SELECT  `users`.* FROM `users` WHERE `users`.`username` = 'portus' LIMIT 1
   (1.1ms)  BEGIN
  SQL (0.4ms)  UPDATE `users` SET `encrypted_password` = 'XXXXX', `updated_at` = '2020-02-03 17:17:04' WHERE `users`.`id` = 1
   (2.1ms)  COMMIT
  User Exists (0.3ms)  SELECT  1 AS one FROM `users` WHERE `users`.`username` = 'portus' LIMIT 1
   (0.1ms)  SELECT COUNT(*) FROM `registries`
Evaluated configuration:
---
email:
  from: portus@example.com
  name: Portus
  reply_to: ''
  smtp:
    enabled: false
    address: smtp.example.com
    port: 587
    domain: example.com
    ssl_tls: ''
    enable_starttls_auto: false
    openssl_verify_mode: none
    ca_path: ''
    ca_file: ''
    user_name: ''
    password: "****"
    authentication: login
gravatar:
  enabled: true
delete:
  enabled: true
  contributors: false
  garbage_collector:
    enabled: false
    older_than: 30
    tag: ''
ldap:
  enabled: true
  hostname: XXX
  port: 636
  timeout: 5
  encryption:
    method: simple_tls
    options:
      ca_file: ''
      ssl_version: TLSv1_2
  base: dc=XXX
  filter: "(memberOf=cn=XXX)"
  uid: uid
  authentication:
    enabled: true
    bind_dn: cn=puppet,dc=XXX
    password: "****"
  guess_email:
    enabled: true
    attr: mail
oauth:
  local_login:
    enabled: true
  google_oauth2:
    enabled: false
    id: ''
    secret: ''
    domain: ''
    options:
      hd: ''
  open_id:
    enabled: false
    identifier: ''
    domain: ''
  github:
    enabled: false
    client_id: ''
    client_secret: ''
    organization: ''
    team: ''
    domain: ''
  gitlab:
    enabled: false
    application_id: ''
    secret: ''
    group: ''
    domain: ''
    server: ''
  bitbucket:
    enabled: false
    key: ''
    secret: ''
    domain: ''
    options:
      team: ''
first_user_admin:
  enabled: true
signup:
  enabled: false
check_ssl_usage:
  enabled: true
registry:
  jwt_expiration_time:
    value: 15
  catalog_page:
    value: 100
  timeout:
    value: 2
  read_timeout:
    value: 120
machine_fqdn:
  value: hub.XXX
display_name:
  enabled: false
user_permission:
  change_visibility:
    enabled: true
  create_team:
    enabled: false
  manage_team:
    enabled: true
  create_namespace:
    enabled: false
  manage_namespace:
    enabled: true
  create_webhook:
    enabled: true
  manage_webhook:
    enabled: true
  push_images:
    policy: allow-teams
security:
  clair:
    server: ''
    health_port: 6061
    timeout: 900
  zypper:
    server: ''
  dummy:
    server: ''
anonymous_browsing:
  enabled: false
background:
  registry:
    enabled: true
  sync:
    enabled: true
    strategy: update-delete
pagination:
  limit: 10
  per_page: 10
  before_after: 2

Portus version: 2.4.3@5a616c0ef860567df5700708256f42505cdb9952

Here is the env:

# docker exec portus-background env 
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=72aed73ad2fc
CCONFIG_PREFIX=PORTUS
PORTUS_DB_HOST=XXX
PORTUS_DB_PASSWORD=XXX
PORTUS_DB_USERNAME=portus
PORTUS_DB_DATABASE=portus
PORTUS_DB_POOL=10
PORTUS_PASSWORD=XXX
PORTUS_SECRET_KEY_BASE=XXX
PORTUS_KEY_PATH=/certificates/portus.key
PORTUS_PUMA_TLS_CERT=/certificates/portus.crt
PORTUS_PUMA_TLS_KEY=/certificates/portus.key
PORTUS_MACHINE_FQDN_VALUE=hub.XXX
PORTUS_LDAP_AUTHENTICATION_BIND_DN=XXX
PORTUS_LDAP_AUTHENTICATION_PASSWORD=XXX
PORTUS_LDAP_BASE=XXX
PORTUS_LDAP_HOSTNAME=XXX
PORTUS_LDAP_PORT=636
PORTUS_LDAP_AUTHENTICATION_ENABLED=true
PORTUS_LDAP_ENABLED=true
PORTUS_LDAP_FILTER=(memberOf=XXX)
PORTUS_LDAP_GUESS_EMAIL_ATTR=mail
PORTUS_LDAP_GUESS_EMAIL_ENABLED=true
PORTUS_LDAP_ENCRYPTION_METHOD=simple_tls
PORTUS_LDAP_UID=uid
RAILS_SERVE_STATIC_FILES=false
PORTUS_ANONYMOUS_BROWSING_ENABLED=false
PORTUS_DELETE_ENABLED=true
PORTUS_GRAVATAR_ENABLED=true
PORTUS_SIGNUP_ENABLED=false
PORTUS_SMTP_ENABLED=false
PORTUS_USER_PERMISSION_CREATE_NAMESPACE_ENABLED=false
PORTUS_USER_PERMISSION_CREATE_TEAM_ENABLED=false
PORTUS_BACKGROUND_SYNC_STRATEGY=update-delete
PORTUS_LOG_LEVEL=debug
PORTUS_BACKGROUND=true
HOME=/root
Jean-Baptiste-Lasselle commented 4 years ago

You lucky @diranged , read this buddy ;) : https://github.com/SUSE/Portus/issues/2241#issuecomment-586653023

set PORTUS_DELETE_ENABLED=true, and don't forget about issue templates :see_no_evil:

diranged commented 4 years ago

@Jean-Baptiste-Lasselle I am confused - we have PORTUS_DELETE_ENABLED=true in the environment for both of our portus containers..

Jean-Baptiste-Lasselle commented 4 years ago

@Jean-Baptiste-Lasselle I am confused - we have PORTUS_DELETE_ENABLED=true in the environment for both of our portus containers..

@diranged true, my mistake,I thought it didn't appear in output of docker exec portus-background env, just saw.

But I have an idea

I believe the explanation might be the following,and I get that from this issue :

# that's my docker-compose.yml
# do same in your config file for both background and portus foreground
# [...]
  background: 
    environment:
    # Sync config
      - PORTUS_BACKGROUND_REGISTRY_ENABLED=true
      - PORTUS_BACKGROUND_SYNC_ENABLED=true
      - PORTUS_BACKGROUND_SYNC_STRATEGY=update-delete
      # what if you set it only to`update` ? I'm curious you run the test and feddback what happens...
      # - PORTUS_BACKGROUND_SYNC_STRATEGY=update

# [....]
  background: 
    environment:
    # Sync config
      - PORTUS_BACKGROUND_REGISTRY_ENABLED=true
      - PORTUS_BACKGROUND_SYNC_ENABLED=true
      - PORTUS_BACKGROUND_SYNC_STRATEGY=update-delete
      # what if you set it only to`update` ? I'm curious you run the test and feddback what happens...
      # - PORTUS_BACKGROUND_SYNC_STRATEGY=update
diranged commented 4 years ago

While I am testing a few things right now.. I'll note that the background_sync_enabled setting is enabled by default - you can see that in the portus:info output.

Jean-Baptiste-Lasselle commented 4 years ago

info

Hi @diranged thank you so much for your feedback about that too, I was indeed discussing on another issue we cross roads with, about the SYNC feature being by default activated, or not; So it is, and what I said in that issue is even more true, i'll go back to that issue to give that to our colleague

indeed checked at http://port.us.org/docs/Configuring-Portus.html#background-process :

  • The sync task can be disabled as well

It can be disabled, so it is enabled by default. Just to say here,the documentation is constantly dangling us like that, we have to guess-work...

Jean-Baptiste-Lasselle commented 4 years ago

aws s3 rm s3://XXX/prod_v2/docker/registry/v2/repositories/${REPO}/${NAME}/_manifests/revisions/sha256/${DIGEST}/link aws s3 rm --recursive s3://XXX/prod_v2/docker/registry/v2/${REPO}/_manifests/tags/${TAG_NAME}/

Hi @diranged , I today have one new, acurate question, to take one more step forward here :

I'm asking because that's exactly what the SYNC does, (in the background) : it calls the Catalog API v2, and compares results to portus DATABASE. And then operates SYNC based on this comparison.

Also asking, because :

2020-02-03T16:44:32.561Z,docker-portus-background,[catalog] Could not fetch manifest for 'XXX/XXX' with tag 'phab-30677-92977':
2020-02-03T16:44:32.560Z,,[catalog] Could not fetch manifest for 'XXX/XXX' with tag 'phab-30677-92977':
2020-02-03T16:44:32.560Z,,"52.21.41.87 - - [03/Feb/2020:16:44:32 +0000] ""GET /v2/XXX/XXX/manifests/phab-30677-92977 HTTP/1.1"" 404 198 ""-"" ""Ruby"""
2020-02-03T16:44:32.560Z,,"172.18.0.5 - - [03/Feb/2020:16:44:32 +0000] ""GET /v2/XXX/XXX/manifests/phab-30677-92977 HTTP/1.0"" 404 198 """" ""Ruby"""

Just in case the test is not too heavy to be done, I'll do the test at some time anyway. Maybe you already ran the test.

Never the less, your case is extremely important to my eyes. 5000 images.

And tojust copy paste

export YOUR_REGISTRY_NETNAME=registry.youprivateinfra.io
export PORTUS_TOKEN=deTcV_afbsqsds566d65565654dfseef5463a
# List all repositories (images) :
curl -i -H "Authorization: Bearer $PORTUS_TOKEN" https://${YOUR_REGISTRY_NETNAME}/v2/_catalog

# List all tags for a repository:
export SOME_REPOSITORY=ubuntu
curl -X GET -H "Authorization: Bearer $PORTUS_TOKEN" https://${YOUR_REGISTRY_NETNAME}/v2/${SOME_REPOSITORY}/tags/list
diranged commented 4 years ago

So ultimately we had to go another route - we had ~160k image tags we needed to purge, and it just was not seeming possible to do it in an automated and clean way. We created a new bucket, migrated the images we wanted into the new bucket, and then reconfigured portus. Once our underlying bucket was small again (80TB -> 1TB, ~160k tags -> ~1k tags), the automatic background sync process worked fine.

It's my guess that the process to do the background sync took so long and was all done in a single DB transaction that it ultimately failed and would just keep re-running over and over again. I have no evidence of this, but when I look at the code it strikes me as silly that most of the background jobs are entirely run in a single transaction. That doesn't scale well on larger implementations.

Jean-Baptiste-Lasselle commented 4 years ago

and would

Oh my gosh, thank you soooooo much for this feedback, @diranged it's like awesome !

I so totally understand about the transaction, and now so much idea to improve the situation with portus !

And yeah, makes so much sense what you thought of solutions , plus , I mean it'seither you hack on my personal computer or we almost are mind-synced .... I wrote that yesterday :

https://github.com/SUSE/Portus/issues/2241#issuecomment-592317456

Just yesterday, yeah, checkout date-times

I can't put it into words, yet, but there's something going on in this field...

Jean-Baptiste-Lasselle commented 4 years ago

@diranged And yeah, so your 404 was the transaction failing ... wasn't it ?

Jean-Baptiste-Lasselle commented 4 years ago

hi @diranged I have some news, wanted to share with you what I shared with @robgiovanardi

Jean-Baptiste-Lasselle commented 4 years ago

Successfully repaired opensuse/portus:2.5 OCI container image

hi @diranged This time it's done, I have a solution to use, all details https://github.com/SUSE/Portus/issues/2241#issuecomment-593370789

:)

I will read any feedback from you with interest

Jean-Baptiste-Lasselle commented 4 years ago

hi @diranged , I have some news here I absolutely want to share with you :

The HyperText Transfer Protocol (HTTP) 202 Accepted response status code indicates that the request has been received but not yet acted upon.

jbl@pegasusio:~/.buildfromsrc.portus$ docker push oci-registry.pegasusio.io/pokus/ipfs-testground:0.0.2
The push refers to repository [oci-registry.pegasusio.io/pokus/ipfs-testground]
42477113dc05: Layer already exists 
30b16eb7ac5e: Layer already exists 
ce8168f12337: Layer already exists 
0.0.2: digest: sha256:f18be685992e4e4a30bcea7d26a75e23e71fe0b38290af15cadf55b77ae724eb size: 953
jbl@pegasusio:~/.buildfromsrc.portus$ docker exec -it compose_registry_1 bash
OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "exec: \"bash\": executable file not found in $PATH": unknown
jbl@pegasusio:~/.buildfromsrc.portus$ docker exec -it compose_registry_1 sh
/ # bin/registry garbage-collect /etc/docker/registry/config.yml
pokus/ipfs-testground
pokus/rocket.chat

0 blobs marked, 12 blobs and 0 manifests eligible for deletion
blob eligible for deletion: sha256:dc65f448a2e2f2ea557e69ed9ac65aa8ac0e16f1bce68f90de910b4d5a2f1ba1
INFO[0000] Deleting blob: /docker/registry/v2/blobs/sha256/dc/dc65f448a2e2f2ea557e69ed9ac65aa8ac0e16f1bce68f90de910b4d5a2f1ba1  go.version=go1.11.2 instance.id=5672e9f1-5147-4ee3-818e-9adb652161f3
blob eligible for deletion: sha256:f18be685992e4e4a30bcea7d26a75e23e71fe0b38290af15cadf55b77ae724eb
INFO[0000] Deleting blob: /docker/registry/v2/blobs/sha256/f1/f18be685992e4e4a30bcea7d26a75e23e71fe0b38290af15cadf55b77ae724eb  go.version=go1.11.2 instance.id=5672e9f1-5147-4ee3-818e-9adb652161f3
blob eligible for deletion: sha256:3a705fad094c9483069d4036c01a0c49381cf5a855d675bb840061fbed10b71c
INFO[0000] Deleting blob: /docker/registry/v2/blobs/sha256/3a/3a705fad094c9483069d4036c01a0c49381cf5a855d675bb840061fbed10b71c  go.version=go1.11.2 instance.id=5672e9f1-5147-4ee3-818e-9adb652161f3
blob eligible for deletion: sha256:3c7454a13eb92a684220862a298cd4d970be862a4bf8671856e03e2fd3b21efe
INFO[0000] Deleting blob: /docker/registry/v2/blobs/sha256/3c/3c7454a13eb92a684220862a298cd4d970be862a4bf8671856e03e2fd3b21efe  go.version=go1.11.2 instance.id=5672e9f1-5147-4ee3-818e-9adb652161f3
blob eligible for deletion: sha256:4d412742cdb2a7107b586098588ce29082d0d5357bd43244a513cfd6f81cf67e
INFO[0000] Deleting blob: /docker/registry/v2/blobs/sha256/4d/4d412742cdb2a7107b586098588ce29082d0d5357bd43244a513cfd6f81cf67e  go.version=go1.11.2 instance.id=5672e9f1-5147-4ee3-818e-9adb652161f3
blob eligible for deletion: sha256:6c6d376e3536f6ab4064c5485bd2f563b4d090cfbb6e48ba7349373fbe3479c1
INFO[0000] Deleting blob: /docker/registry/v2/blobs/sha256/6c/6c6d376e3536f6ab4064c5485bd2f563b4d090cfbb6e48ba7349373fbe3479c1  go.version=go1.11.2 instance.id=5672e9f1-5147-4ee3-818e-9adb652161f3
blob eligible for deletion: sha256:a6a4baaaf9d474e54882cd067bfa96e894d6e037a9d0e9efe3f4ef7c83c773e5
INFO[0000] Deleting blob: /docker/registry/v2/blobs/sha256/a6/a6a4baaaf9d474e54882cd067bfa96e894d6e037a9d0e9efe3f4ef7c83c773e5  go.version=go1.11.2 instance.id=5672e9f1-5147-4ee3-818e-9adb652161f3
blob eligible for deletion: sha256:a8ae8a1bb8caea4adfc9e40f1873b018b04b858d6467be8ee7f9edf902ed02bc
INFO[0000] Deleting blob: /docker/registry/v2/blobs/sha256/a8/a8ae8a1bb8caea4adfc9e40f1873b018b04b858d6467be8ee7f9edf902ed02bc  go.version=go1.11.2 instance.id=5672e9f1-5147-4ee3-818e-9adb652161f3
blob eligible for deletion: sha256:0fdc1692cfe97373ceb9550adfd87071e4763c3e8221fa0784a891dbcb650b6e
INFO[0000] Deleting blob: /docker/registry/v2/blobs/sha256/0f/0fdc1692cfe97373ceb9550adfd87071e4763c3e8221fa0784a891dbcb650b6e  go.version=go1.11.2 instance.id=5672e9f1-5147-4ee3-818e-9adb652161f3
blob eligible for deletion: sha256:491c5bd642d1e04229bf41718ca391d820535c78544418d25b90f9ba836e5a3b
INFO[0000] Deleting blob: /docker/registry/v2/blobs/sha256/49/491c5bd642d1e04229bf41718ca391d820535c78544418d25b90f9ba836e5a3b  go.version=go1.11.2 instance.id=5672e9f1-5147-4ee3-818e-9adb652161f3
blob eligible for deletion: sha256:4d43c4a94fa978c4e4cd255436c2f47ede74aed95c0ecc5dd31d16727f91d09c
INFO[0000] Deleting blob: /docker/registry/v2/blobs/sha256/4d/4d43c4a94fa978c4e4cd255436c2f47ede74aed95c0ecc5dd31d16727f91d09c  go.version=go1.11.2 instance.id=5672e9f1-5147-4ee3-818e-9adb652161f3
blob eligible for deletion: sha256:66f50e765badb4dad6abd38d8a98e677090c59883ed086547d1a718d4a832e0a
INFO[0000] Deleting blob: /docker/registry/v2/blobs/sha256/66/66f50e765badb4dad6abd38d8a98e677090c59883ed086547d1a718d4a832e0a  go.version=go1.11.2 instance.id=5672e9f1-5147-4ee3-818e-9adb652161f3
/ # exit
jbl@pegasusio:~/.buildfromsrc.portus$ docker push oci-registry.pegasusio.io/pokus/ipfs-testground:0.0.2
The push refers to repository [oci-registry.pegasusio.io/pokus/ipfs-testground]
42477113dc05: Pushed 
30b16eb7ac5e: Pushed 
ce8168f12337: Pushed 
0.0.2: digest: sha256:f18be685992e4e4a30bcea7d26a75e23e71fe0b38290af15cadf55b77ae724eb size: 953

Last, but not least : This will not solve the issue you had with the portus background "because the delete list is so big that the whole process times out" .... Unless... Unlesss

Unless we can scale out backgound.

And this is why I am going, sooner than planned, to test and make portus scale in Kubernetes . I prepared this for the purpose, If you want to try (early access, still a messy docuemntation , yet usable with a little patience) : https://github.com/pegasus-io/a-k8s-demo/releases/tag/0.0.4

So basically, this test that I carried out yesterday shows that even with portus working a 100 percent as designed, we still need to carry out the docker registry garbage collector.

This garbage collection thing is what you used with what you called "method 2", and :

This is extremely interesting.

I want to again thank you for point out the missing commit from master related to keep_latest on @robgiovanardi 's issue, it is one of the things that gave me courage to crack it all up

Jean-Baptiste-Lasselle commented 4 years ago

So merry Easter

stale[bot] commented 4 years ago

Thanks for all your contributions! This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.