NERSC / shifter

Shifter - Linux Containers for HPC
Other
348 stars 65 forks source link

Pulling images fails when image is bigger #242

Open woonghu opened 5 years ago

woonghu commented 5 years ago

Hi. I wanna download & convert some of docker images to shifter images. I've downloaded image successfully when image size is smaller than amount 4GB. but the problem is that Image size is bigger than amount 4GB.

there are infinity "PULLING" messages and cannot finish pulling un image like this.

_Message: { "ENTRY": "MISSING", "ENV": "MISSING", "WORKDIR": "MISSING", "groupACL": [], "id": "MISSING", "itype": "docker", "last_pull": 1549005472.860614, "status": "PULLING", "status_message": "Extracting Layers", "system": "mycluster", "tag": [], "userACL": [] } 2019-02-01T07:58:59 Pulling Image: docker:imagename:1.0.0-12, status: PULLING

and here are access and error log while pulling an image.

_==> error.log <== [2019-02-01 07:59:37 +0000] [51665] [DEBUG] Closing connection. [2019-02-01 07:59:37 +0000] [51691] [DEBUG] POST /api/pull/mycluster/docker/si-swhong%3A1.0.0-12/ [2019-02-01 07:59:37 +0000] [51691] [DEBUG] pull system=mycluster imgtype=docker tag=si-swhong:1.0.0-12 [2019-02-01 07:59:37 +0000] [51691] [DEBUG] {'tag': u'si-swhong:1.0.0-12', 'itype': u'docker', 'system': u'mycluster'} [2019-02-01 07:59:37 +0000] [51691] [DEBUG] {'magic': 'imagemngrmagic', 'uid': 0, 'system': u'mycluster', 'tokens': {u'soe-db1:5000': u'u:p', u'default': u'u:p'}, 'gid': 0, 'user': 'root', 'group': 'root'} [2019-02-01 07:59:37 +0000] [51691] [DEBUG] Pull called Test Mode=0 [2019-02-01 07:59:37 +0000] [51691] [DEBUG] {u'status': u'PULLING', u'ostcount': u'0', u'itype': u'docker', u'format': u'squashfs', u'last_heartbeat': 1549005477.959195, u'os': u'linux', u'groupACL': [], u'system': u'mycluster', u'private': None, u'status_message': u'Extracting Layers', u'pulltag': u'si-swhong:1.0.0-12', u'replication': u'1', u'tag': [], u'userACL': [], u'location': u'', u'last_pull': 1549005472.860614, u'remotetype': u'dockerv2', u'_id': ObjectId('5c53f2a0227509ba7a533871'), u'arch': u'amd64'}

...

[2019-02-01 08:19:08 +0000] [51691] [DEBUG] Closing connection. [2019-02-01 08:19:09 +0000] [51659] [CRITICAL] WORKER TIMEOUT (pid:51666) [2019-02-01 08:19:09 +0000] [51666] [WARNING] 1 [2019-02-01 08:19:09 +0000] [51666] [ERROR] ERROR: dopull failed system=mycluster tag=si-swhong:1.0.0-12 [2019-02-01 08:19:09 +0000] [51666] [INFO] Worker exiting (pid: 51666) [2019-02-01 08:19:09 +0000] [51685] [WARNING] Operation failed for 5c5400b6227509c9d2e26974 [2019-02-01 08:19:09 +0000] [51685] [INFO] Shutting down Status Thread [2019-02-01 08:19:09 +0000] [51691] [DEBUG] POST /api/pull/mycluster/docker/si-swhong%3A1.0.0-12/ [2019-02-01 08:19:09 +0000] [51691] [DEBUG] pull system=mycluster imgtype=docker tag=si-swhong:1.0.0-12 [2019-02-01 08:19:09 +0000] [51691] [DEBUG] {'tag': u'si-swhong:1.0.0-12', 'itype': u'docker', 'system': u'mycluster'} [2019-02-01 08:19:09 +0000] [51691] [DEBUG] {'magic': 'imagemngrmagic', 'uid': 0, 'system': u'mycluster', 'tokens': {u'soe-db1:5000': u'u:p', u'default': u'u:p'}, 'gid': 0, 'user': 'root', 'group': 'root'} [2019-02-01 08:19:09 +0000] [51691] [DEBUG] Pull called Test Mode=0 [2019-02-01 08:19:09 +0000] [51691] [DEBUG] {u'status': u'FAILURE', u'ostcount': u'0', u'itype': u'docker', u'format': u'squashfs', u'last_heartbeat': 1549009149.245521, u'os': u'linux', u'groupACL': [], u'system': u'mycluster', u'private': None, u'status_message': u'FAILURE', u'pulltag': u'si-swhong:1.0.0-12', u'replication': u'1', u'tag': [], u'userACL': [], u'location': u'', u'last_pull': 1549009078.383216, u'remotetype': u'dockerv2', u'_id': ObjectId('5c5400b6227509c9d2e26974'), u'arch': u'amd64'}

==> access.log <== 127.0.0.1 - - [01/Feb/2019:07:59:37 +0000] "POST /api/pull/mycluster/docker/si-swhong%3A1.0.0-12/ HTTP/1.1" 200 290 "-" "-"_

I have tried to upgrade gunicorn(to 19.9) and also install gevent(1.3.6). but it does not help them.

ExecStart=/usr/bin/gunicorn \ -b 0.0.0.0:6000 --backlog 2048 \ --log-level=debug \ --access-logfile=/var/log/shifter_imagegw/access.log \ --log-file=/var/log/shifter_imagegw/error.log \ --timeout 60 \ --workers 4 \ --threads 4 \ --worker-class=gevent \ shifter_imagegw.api:app

how could resolve this problem?

scanon commented 5 years ago

Which version are you running? Specifically is this the version that still uses celery? There is a timeout parameter that needs to be boosted.

woonghu commented 5 years ago

I’m using 18.03. Does parameter ‘PullUpdateTimeout’ mean? I set it up to 3600. But it does not work.

scanon commented 5 years ago

I had to remind myself how I had fixed this before. The issue is with the gunicorn timeout. I just submitted a PR for this. But you can take a look at it and see what has to be changed.

https://github.com/NERSC/shifter/pull/243

You just need to modify the service script to add the -t 3600 option. That gives it an hour which should be enough even for very large images.

woonghu commented 5 years ago

Thank you for response.

I have another question.

Why pulling & converting an image is consuming much more time unlike docker?

Is there way to improve that?

Happy new year:)

scanon commented 5 years ago

I meant to reply this sooner.

Shifter has to do the expansion and squash on each fresh pull. It does cache the layers. But it has to re-unzip each layer to build the squash image. I have noticed that the unzip for some layers can be very slow but have never been able to get to the bottom of it. I think it has something to do with the zip python library we use and how we are using.

I would recommend using a fast file system for the temporary space where the API/worker runs. If it is a large memory node (> ~32 GB), you can even use /dev/shm for the expand directory. This can help to some degree.

Let me know if you need the exact parameter to adjust.

woonghu commented 5 years ago

Thank you Canon.

Your answer is very helpful for me.

would you give me the exact parameter to adjust?

scanon commented 5 years ago

Look this line the example config file. You just need to change that to /dev/shm or some other location that is one fast local storage or RAM.

https://github.com/NERSC/shifter/blob/master/imagegw/imagemanager.json.example#L12