Closed willejs closed 6 years ago
@dsheets @samoht Ideas?
Could you direct us to something about why btrfs
is necessary for this use case? We currently support aufs
and overlay2
and we would like the user experience of Docker for Mac to abstract the graph driver decision entirely. Any information about why btrfs
is necessary would be really helpful for us. Thanks!
Hi @dsheets I'm using concourse ci and they create a scratch disk using btrfs when starting a worker container, if not it falls back to tmpfs which is sloow. However, given btrfs is in the mainline kernel, and hailed as the future filesystem, would it not make sense to add it to the underlying image? Even a way to bake a custom image with a custom kernel would suffice... What are your thoughts?
I'm curious about why Concourse CI requires btrfs
and falls back to tmpfs
. Do you know (or could you find out) why Concourse has these particular requirements and fallback chain?
not using concourse here, but adding another reason to the discussion, aufs
does not support disk size quota, right? (as of https://github.com/docker/docker/blob/7e29b33546098816d5dbc1fc429e868f02b69e44/docs/reference/commandline/run.md#set-storage-driver-options-per-container) - brtfs does.
@dsheets Its hazy. Basically, when you run a concourse worker in docker, it runs tasks in docker containers in the worker, so it does docker in docker. When it does this, it creates btrfs filesystems inside the docker container, I think it mounts them to the task containers, not sure. Anyway, if not it falls back to VFS and creates a tmpfs filesystem which is slow.
Anyway, that aside, why wouldn't you want to support BTRFS? Are there any plans to make this all open source?
Related Concourse issue https://github.com/concourse/concourse/issues/896
+1
I had to downgrade from 1.13.0 back to 1.12.x due to breaking change with btrfs affecting concourse.
@willejs Concourse Ci does not use docker in docker, in fact it does not use docker compose to orchestrate its internal containers. It uses cloudfoundry/garden. I'm planning to use docker in docker to run multiple containers in one concourse task (which can only use one garden container) to be able to do integration tests using various dependencies like selenium, mysql, nginx each in its own container using docker-compose like in my dev/stage/prod environments.
+1 I also had to downgrade to docker-toolbox because of this issue. I don't know if docker/for-mac should support btrfs or concourse should use other storage drivers.
@dsheets Concourse uses btrfs
because it nests trivially, allowing our docker-image
resource to simply spin up Docker, have it use its btrfs
driver, and fetch images with the Docker CLI. If we were to use aufs
or overlay
the resource would have to use a loopback device to make a local system image, as neither of those nest. This is costly as there can be many docker-image
resources, and loopback devices are a global system resource, that can outlive their container if we're not careful.
@dsheets sorry to ping on this one, but it's forcing me to stay on 1.12.x and that's quickly going to become a pain point as pressure to upgrade increases
is there an official stance on why btrfs support is gone? I'm surprised considering how glowingly positive the official docker article and release notes are for it
@doubledgedboard As far as I know, Docker for Mac has never supported btrfs
. What is the breaking change from 1.12.x to 1.13.x?
We haven't enabled btrfs
because it slows boot by an unacceptably long time for a feature that is typically unused.
@dsheets so what is the change that caused https://github.com/concourse/concourse/issues/896 to break from 1.12.x to 1.13.x?
The concourse team is saying it's a btrfs issue with docker for mac.
@eedwardsdisco I don't know what change caused the regression. Could you please post a step-by-step reproduction with any required configuration files here so we can investigate or bisect the issue? We are not familiar with Concourse so a sequence of steps to go from a fresh macOS install to either success (under 1.12.x) or failure (under 1.13.x) would greatly speed our work. Thanks!
@dsheets Try this docker-compose in docker/for-mac:
concourse-db:
image: postgres:9.5
environment:
POSTGRES_DB: concourse
POSTGRES_USER: concourse
POSTGRES_PASSWORD: changeme
PGDATA: /database
concourse-web:
image: concourse/concourse
links: [concourse-db]
command: web
ports: ["8080:8080"]
volumes: ["./keys/web:/concourse-keys"]
environment:
CONCOURSE_BASIC_AUTH_USERNAME: concourse
CONCOURSE_BASIC_AUTH_PASSWORD: changeme
CONCOURSE_EXTERNAL_URL: http://ci.example.app:8080
CONCOURSE_POSTGRES_DATA_SOURCE: |-
postgres://concourse:changeme@concourse-db:5432/concourse?sslmode=disable
concourse-worker:
image: concourse/concourse
privileged: true
links: [concourse-web]
command: worker
volumes: ["./keys/worker:/concourse-keys"]
environment:
CONCOURSE_TSA_HOST: concourse-web
Use this docker-compose script to bring up concourse in docker for mac and then try to run this (or any) simple pipeline.
groups:
- name: develop
jobs:
- navi
resources:
- name: every-1m
type: time
source: {interval: 1m}
jobs:
- name: navi
plan:
- get: every-1m
trigger: true
- task: annoy
config:
platform: linux
image_resource:
type: docker-image
source: {repository: ubuntu}
run:
path: echo
args: ["Hey! Listen!"]
@dsheets you will also need the fly-cli to login and register the pipeline https://concourse.ci/fly-cli.html
@berisberis Ok, I run docker-compose up
with the compose file and get
concourse-web_1 | failed to load authorized keys: open : no such file or directory
concoursebug_concourse-web_1 exited with code 1
I'm not sure what to do with your second file. Where do I save it and with what file name? Do I need to install software on the host? Which software exactly (version)? How do I run the pipeline?
@dsheets yo also need a folder ./keys/web
and ./keys/worker
in the same path as the docker-compose.
@dsheets also... for the second file you can name it whatever you want .yml that is the name you will use when registering the pipeline with the fly-cli
When you have the fly-cli run this to login:
fly -t concourse login -c http://ci.example.app:8080
use the user and login in the docker-compose file.
then use this to register the pipeline:
fly sp -t concourse -c ~/path/to/your/pipeline.yml -p MyPipeline
@berisberis I have created keys/web
and I still get the error above. I downloaded the fly CLI binary 2.7.0 from https://concourse.ci/downloads.html but I'm not sure if I need the web container running before testing the system. I don't understand which steps must be done in order to observe the failure and what state is present after they are done. It would be very helpful to have a list of exactly the steps needed to reproduce the issue, preferably with as few steps as possible. Additionally, knowing the easiest way to reset the system (other than deleting everything related) would be helpful but isn't necessary. We don't know how to use Concourse or what its state model looks like and we unfortunately don't have time to learn how to use Concourse competently and then guess whether we are seeing the same failure you are seeing.
@dsheets
Hey David,
Here's some explicit steps (from http://concourse.ci/docker-repository.html)
Create docker-compose.yml (uses latest concourse binary)
concourse-db:
image: postgres:9.5
environment:
POSTGRES_DB: concourse
POSTGRES_USER: concourse
POSTGRES_PASSWORD: changeme
PGDATA: /database
concourse-web:
image: concourse/concourse
links: [concourse-db]
command: web
ports: ["8080:8080"]
volumes: ["./keys/web:/concourse-keys"]
environment:
CONCOURSE_BASIC_AUTH_USERNAME: concourse
CONCOURSE_BASIC_AUTH_PASSWORD: changeme
CONCOURSE_EXTERNAL_URL: "${CONCOURSE_EXTERNAL_URL}"
CONCOURSE_POSTGRES_DATA_SOURCE: |-
postgres://concourse:changeme@concourse-db:5432/concourse?sslmode=disable
concourse-worker:
image: concourse/concourse
privileged: true
links: [concourse-web]
command: worker
volumes: ["./keys/worker:/concourse-keys"]
environment:
CONCOURSE_TSA_HOST: concourse-web
create keys
mkdir -p keys/web keys/worker
ssh-keygen -t rsa -f ./keys/web/tsa_host_key -N ''
ssh-keygen -t rsa -f ./keys/web/session_signing_key -N ''
ssh-keygen -t rsa -f ./keys/worker/worker_key -N ''
cp ./keys/worker/worker_key.pub ./keys/web/authorized_worker_keys
cp ./keys/web/tsa_host_key.pub ./keys/worker
create a host entry in /private/etc/hosts pointing 'concourse' to your current local interface IP (not loopback!)
192.168.1.10 concourse
export env var mapping external host to your custom local dns
export CONCOURSE_EXTERNAL_URL=http://concourse:8080
start the concourse stack (web\worker\coordinator)
docker-compose up
browse to the url and download the fly cli from the link on the page
http://concourse:8080
use the fly cli to create your login target (yes the password literally is 'changeme' as per above)
fly login --target=main --concourse-url=http://concourse:8080 --username=concourse --password=changeme --team-name=main
create navi-pipeline.yml pipeline file
resources:
- name: every-1m
type: time
source: {interval: 1m}
jobs:
- name: navi
plan:
- get: every-1m
trigger: true
- task: annoy
config:
platform: linux
image_resource:
type: docker-image
source: {repository: ubuntu}
run:
path: echo
args: ["Hey! Listen!"]
upload pipeline to concourse
fly -t main set-pipeline -p hello-world -c navi-pipeline.yml
observe automatic (every minute) invocation of pipeline at the url
http://concourse:8080
destroy the stack (e.g. to then switch underlying docker versions...)
docker-compose down
repeat as needed
@eedwardsdisco Thanks! I did all of that under 1.13.1 and 1.12.6 and, as far as I could tell, the behavior was the same. The pipeline shows "pending" pulsing, "starting" pulsing, and eventually "failed" highlighted in the web UI. The logs of the worker show:
concourse-worker_1 | {"timestamp":"1488378340.510931969","source":"worker","message":"worker.baggageclaim.fs.run-command.failed","log_level":2,"data":{"args":["bash","-e","-x","-c","\n\t\tif [ ! -e $IMAGE_PATH ] || [ \"$(stat --printf=\"%s\" $IMAGE_PATH)\" != \"$SIZE_IN_BYTES\" ]; then\n\t\t\ttouch $IMAGE_PATH\n\t\t\ttruncate -s ${SIZE_IN_BYTES} $IMAGE_PATH\n\t\tfi\n\n\t\tlo=\"$(losetup -j $IMAGE_PATH | cut -d':' -f1)\"\n\t\tif [ -z \"$lo\" ]; then\n\t\t\tlo=\"$(losetup -f --show $IMAGE_PATH)\"\n\t\tfi\n\n\t\tif ! file $IMAGE_PATH | grep BTRFS; then\n\t\t\t/worker-state/2.7.0/linux/btrfs/mkfs.btrfs --nodiscard $IMAGE_PATH\n\t\tfi\n\n\t\tmkdir -p $MOUNT_PATH\n\n\t\tif ! mountpoint -q $MOUNT_PATH; then\n\t\t\tmount -t btrfs $lo $MOUNT_PATH\n\t\tfi\n\t"],"command":"/bin/bash","env":["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin","MOUNT_PATH=/worker-state/volumes","IMAGE_PATH=/worker-state/volumes.img","SIZE_IN_BYTES=63381999616"],"error":"exit status 32","session":"2.2.1","stderr":"+ '[' '!' -e /worker-state/volumes.img ']'\n++ stat --printf=%s /worker-state/volumes.img\n+ '[' 63381999616 '!=' 63381999616 ']'\n++ losetup -j /worker-state/volumes.img\n++ cut -d: -f1\n+ lo=\n+ '[' -z '' ']'\n++ losetup -f --show /worker-state/volumes.img\n+ lo=/dev/loop1\n+ file /worker-state/volumes.img\n+ grep BTRFS\nbash: line 11: file: command not found\n+ /worker-state/2.7.0/linux/btrfs/mkfs.btrfs --nodiscard /worker-state/volumes.img\n+ mkdir -p /worker-state/volumes\n+ mountpoint -q /worker-state/volumes\n+ mount -t btrfs /dev/loop1 /worker-state/volumes\nmount: unknown filesystem type 'btrfs'\n","stdout":"btrfs-progs v4.4\nSee http://btrfs.wiki.kernel.org for more information.\n\nLabel: (null)\nUUID: b2f2ac93-4ccd-4f47-9216-f3c06ff4223e\nNode size: 16384\nSector size: 4096\nFilesystem size: 59.03GiB\nBlock group profiles:\n Data: single 8.00MiB\n Metadata: DUP 1.01GiB\n System: DUP 12.00MiB\nSSD detected: no\nIncompat features: extref, skinny-metadata\nNumber of devices: 1\nDevices:\n ID SIZE PATH\n 1 59.03GiB /worker-state/volumes.img\n\n"}}
which contains the error:
mount: unknown filesystem type 'btrfs'
from
mount -t btrfs /dev/loop1 /worker-state/volumes
I don't see a difference between running your reproduction on 1.12.6 and 1.13.1 so I don't think there's been a regression in Docker for Mac or this test case does not show it. I see
{"timestamp":"1488379673.105854750","source":"worker","message":"worker.baggageclaim.falling-back-on-naive-driver","log_level":2,"data":{"error":"exit status 32","session":"2"}}
in the logs but no further mention of the driver. Later, I see
{"timestamp":"1488379673.116163015","source":"worker","message":"worker.beacon.restarting","log_level":2,"data":{"error":"failed to dial: failed to connect to TSA: dial tcp 172.17.0.3:2222: getsockopt: connection refused","session":"3"}}{"timestamp":"1488379673.116110563","source":"baggageclaim","message":"baggageclaim.listening","log_level":1,"data":{"addr":"127.0.0.1:7788"}}
which sounds potentially fatal. Is this error expected?
@dsheets
Hrm strange. I'm using 1.12.3 and it worked. I'm going to try and see if I can get a copy of 1.12.6 and see if it breaks on that. (and I'm on OSX Sierra 10.12.3)
Pulling ubuntu@sha256:dd7808d8792c9841d0b460122f1acf0a2dd1f56404f8d1e56298048885e45535...
sha256:dd7808d8792c9841d0b460122f1acf0a2dd1f56404f8d1e56298048885e45535: Pulling from library/ubuntu
d54efb8db41d: Pulling fs layer
f8b845f45a87: Pulling fs layer
e8db7bf7c39f: Pulling fs layer
9654c40e9079: Pulling fs layer
6d9ef359eaaa: Pulling fs layer
9654c40e9079: Waiting
6d9ef359eaaa: Waiting
f8b845f45a87: Verifying Checksum
f8b845f45a87: Download complete
e8db7bf7c39f: Download complete
9654c40e9079: Verifying Checksum
9654c40e9079: Download complete
6d9ef359eaaa: Verifying Checksum
6d9ef359eaaa: Download complete
d54efb8db41d: Verifying Checksum
d54efb8db41d: Download complete
d54efb8db41d: Pull complete
f8b845f45a87: Pull complete
e8db7bf7c39f: Pull complete
9654c40e9079: Pull complete
6d9ef359eaaa: Pull complete
Digest: sha256:dd7808d8792c9841d0b460122f1acf0a2dd1f56404f8d1e56298048885e45535
Status: Downloaded newer image for ubuntu@sha256:dd7808d8792c9841d0b460122f1acf0a2dd1f56404f8d1e56298048885e45535
Successfully pulled ubuntu@sha256:dd7808d8792c9841d0b460122f1acf0a2dd1f56404f8d1e56298048885e45535.
Hey! Listen!
@dsheets
Very strange. While you're seeing failure on both versions, I'm now seeing success.
Tried the repro I gave you on 1.12.3 (what I was running) and it worked, but then upgraded to 1.12.6 (the version you were running) and it worked, and then finally upgraded to latest ( 17.03.0-ce-mac1) and it still worked.
Rebooted, cleared my local volumes, ran again, still worked.
I'm stumped but I can't complain. I'll wait for others to see what they report when running latest concourse + docker.
I get a similar error running Windows 10 w/ Docker 17.03.1-ce, build c6d412e and using latest instructions and builds from http://concourse.ci/docker-repository.html.
docker: Error response from daemon: error creating aufs mount to /var/lib/docker/aufs/mnt/9df06285b4b1b55e5c87c8f9b74274b6404cc2fd5e259db0127f236946322fed-init: invalid argument.
See 'docker run --help'.
btrfs
is enabled in the LinuxKit based Docker for Mac. However it is only available as a module as compiling it into the kernel slows down the boot process considerably.
You may have to modprobe btrfs
from a sufficiently privileged container first.
@rn Thanks!
Closed issues are locked after 30 days of inactivity. This helps our team focus on active issues.
If you have found a problem that seems similar to this, please open a new issue.
Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. /lifecycle locked
Expected behavior
When selecting the btrfs docker driver, it works
Actual behavior
It does not work as btrfs is not enabled in the kernel.
Information
Steps to reproduce the behavior
Possible fix
Compile the kernel with btrfs support enabled in the underlying VM for mac.