JetBrains / teamcity-docker-agent

TeamCity agent docker image sources
https://hub.docker.com/r/jetbrains/teamcity-agent/
Apache License 2.0
77 stars 64 forks source link

Docker support fails even with privileged flag #16

Closed braunsonm closed 4 years ago

braunsonm commented 6 years ago

The following teamcity agent:

    teamcity-agent-stuff:
        image: jetbrains/teamcity-agent:latest
        restart: always
        environment:
            - "SERVER_URL=http://something:8111"
            - "DOCKER_IN_DOCKER=start"
            - "AGENT_NAME=stuff"
        volumes:
            - teamcity-agent-stuff:/data/teamcity_agent/conf
        networks:
          - teamcity-network
        # Adds docker in docker support
        privileged: true

Still fails when trying a Docker Build step. Warning: failed to get default registry endpoint from daemon (Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?). Using system default: https://index.docker.io/v1/ Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

braunsonm commented 6 years ago

The agent itself states it has Docker Build support, however the agent cannot find the docker daemon socket to run any docker build commands.

braunsonm commented 6 years ago

I found that this can be fixed by entering the container and running service docker start It turns out the containers don't automatically start the daemon on their own for some reason.

Note the following:

root@e607d9db05de:/services# cat run-docker.sh
#!/bin/bash

if [ "$DOCKER_IN_DOCKER" = "start" ] ; then
 service docker start
 echo "Docker daemon started"
firoot@e607d9db05de:/services# echo $DOCKER_IN_DOCKER
start

As you can see, my DOCKER_IN_DOCKER variable is set, however the script did not run on startup.

JamesMcMahon commented 6 years ago

Hey,

The DOCKER_IN_DOCKER setting works for me. They could have updated the image or it could be the way I am formatting it.

This is what I have:

teamcity-agent:
  image: jetbrains/teamcity-agent:latest
  environment:
    - SERVER_URL=http://server:8111
    - AGENT_NAME=regular_agent
    - DOCKER_IN_DOCKER=start
  privileged: true

So I no longer see the failed to get default registry endpoint from daemon error, but I do now see this error:

docker: Error response from daemon: error creating aufs mount to /var/lib/docker/aufs/mnt/3d9374e474dc71e700bc70591563f5d79409088b4a9ea75b7a338fbc1b2c24f8-init: invalid argument. See 'docker run --help'.

Anyone from the Jetbrains team have any idea what might be going on?

VladRassokhin commented 6 years ago

@JamesMcMahon I haven't tested it myself but seems it's impossible to use aufs-in-aufs. AFAIR we don't pass outer docker socket into agent, but running another docker service inside agent, so that approach declares some limitations yet provides security.

JamesMcMahon commented 6 years ago

Thanks for the answer @VladRassokhin, does that mean that the DOCKER_IN_DOCKER option is broken? If so it may be good to remove the option all together to avoid people putting in cycles to get it to work.

I actually did get a little further down the road then my post above. I was being a dummy and didn't have a volume mount, which is what was causing that error. Once I added the volume mount I got a completely different error, which I sadly don't have one me.

braunsonm commented 6 years ago

@JamesMcMahon The option doesn't work in aufs possibly. It works with other file systems.

JamesMcMahon commented 6 years ago

Ah ok, I've never dived into Docker storage drivers. What is the correct option to use to make this work? Or is it basically, works on everything else except aufs?

braunsonm commented 6 years ago

I use overlay2 and it works fine other than the issue I mentioned above about failing to auto start. You can enter the container and start it yourself and then it's fine.

drawm commented 6 years ago

Same problem here. The docker service seems to stop after ~20sec, even if I start it manualy. Using overlay2 (also tested with aufs) Docker version 18.03.0-ce, build 0520e24302

# From inside the agent's container
root@f2691c25f7f7:/# service docker status
 * Docker is not running
root@f2691c25f7f7:/# service docker start
 * Starting Docker: docker                                                                                                                                                              [ OK ] 
root@f2691c25f7f7:/# service docker status
 * Docker is running
root@f2691c25f7f7:/# time docker ps
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

real    0m23.089s
user    0m0.010s
sys 0m0.011s
root@f2691c25f7f7:/# service docker status
 * Docker is not running
#docker-compose.yml
version: '3.1'

networks:
  ext:

volumes:
  docker_volumes:

services:

  agent:
    image: jetbrains/teamcity-agent:2017.2.3
    environment:
      - DOCKER_IN_DOCKER=start
      - SERVER_URL=server.address
    volumes:
      - docker_volumes:/var/lib/docker
    privileged: true
    networks:
      - ext

Any idea on how to fix this?

drawm commented 6 years ago

Same result with the less secure approach...

version: '3.1'

networks:
  ext:

services:
  agent:
    image: jetbrains/teamcity-agent
    environment:
      - DOCKER_IN_DOCKER=start
      - SERVER_URL=server.address
    volumes:
      - volumes/temp=/opt/buildagent/temp
      - volumes/work=/opt/buildagent/work
    networks:
      - ext

Logs when running service docker start

time="2018-04-04T20:34:15.007189666Z" level=info msg="libcontainerd: started new docker-containerd process" pid=3002
time="2018-04-04T20:34:15Z" level=info msg="starting containerd" module=containerd revision=89623f28b87a6004d4b785663257362d1658a729 version=v1.0.0 
time="2018-04-04T20:34:15Z" level=info msg="setting subreaper..." module=containerd 
time="2018-04-04T20:34:15Z" level=info msg="changing OOM score to -500" module=containerd 
containerd: write /proc/3002/oom_score_adj: permission denied
time="2018-04-04T20:34:15.022913374Z" level=error msg="containerd did not exit successfully" error="exit status 1" module=libcontainerd
Failed to connect to containerd: failed to dial "/var/run/docker/containerd/docker-containerd.sock": dial unix:///var/run/docker/containerd/docker-containerd.sock: timeout

/var/run/docker/containerd/docker-containerd.sock does not exist

root@758c10b37338:/# ls -lah /var/run/docker/containerd/
total 20K
drwx------ 3 root root 4.0K Apr  4 20:36 .
drwx------ 3 root root 4.0K Apr  4 20:18 ..
-rw------- 1 root root  559 Apr  4 20:36 containerd.toml
drwx--x--x 2 root root 4.0K Apr  4 20:18 daemon
-rw-r----- 1 root root    4 Apr  4 20:36 docker-containerd.pid # <--- disapear when docker service stop

Is the path to docker-containerd.sock invalid? can we change it with a config?

drawm commented 6 years ago

I switched back to the secure method

docker-compose down --volumes
cp ../docker-compose.secure.ymp ./docker-compose.yml
docker-compose up

Stangely, the containerd.sock is present and the docker service is running.

root@9829cc5ff57f:/# ls -lah /var/run/docker/containerd/
total 20K
drwx------ 3 root root 4.0K Apr  4 20:40 .
drwx------ 6 root root 4.0K Apr  4 20:40 ..
-rw------- 1 root root  559 Apr  4 20:40 containerd.toml
drwx--x--x 3 root root 4.0K Apr  4 20:40 daemon
srw-rw---- 1 root root    0 Apr  4 20:40 docker-containerd-debug.sock
-rw-r----- 1 root root    2 Apr  4 20:40 docker-containerd.pid
srw-rw---- 1 root root    0 Apr  4 20:40 docker-containerd.sock
root@9829cc5ff57f:/# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
root@9829cc5ff57f:/# service docker status
 * Docker is running

But when I tried to run a build configuration agains that agent, the build simply froze... The build log simply print: [20:42:54]The build is removed from the queue to be prepared for the start Its been +10min since the build failed, the agent has been removed from Teamcity but the build is still waiting for it to reply. 2018_04_04_16h56_49s

This is not what I expect from Jetbrain's products... I wonder how I can justify Teamcity's cost to my boss when I have to spend hours to debug it.

drawm commented 6 years ago

Possible solution that worked for me. The documentation on https://hub.docker.com/r/jetbrains/teamcity-agent/ does not tell you why you need all thoses volumes. Turns out they aren't ment to be shared between your agent, each instance should have its own volumes. I simply removed the volumes as we do not care about the warm up time and everything started working.

Based on this issue https://youtrack.jetbrains.com/issue/TW-53769

Final docker-compose

version: '3.1'

networks:
  ext:

services:

  agent:
    image: jetbrains/teamcity-agent:2017.2.3
    environment:
      - DOCKER_IN_DOCKER=start
      - SERVER_URL=server.address
    privileged: true
    networks:
      - ext
braunsonm commented 6 years ago

@drawm That is clearly shown in the documentation for using this container. It is also common sense for agents to not share their configuration with other agents.

Your replies have no relation to the issue I reported,

VladRassokhin commented 6 years ago

@ChaosCA Well, I've tried it myself and cannot reporoduce issue both on local installation (usign dokcer-compose) and GCP (using kubernetes). Here's my docker-compose.yaml content

version : '2'

services:
  server:
    image: 'jetbrains/teamcity-server:2017.2.3'
    ports:
      - 8111:8111
    environment:
      - TEAMCITY_SERVER_MEM_OPTS="-Xmx750m"
    volumes:
      - ./data:/data/teamcity_server/datadir
      - ./logs:/opt/teamcity/logs
  agent:
    image: 'jetbrains/teamcity-agent:2017.2.3'
    environment:
      - SERVER_URL=http://server:8111
      - AGENT_NAME=full-agent
      - DOCKER_IN_DOCKER=start
    privileged: true

Could you please share docker logs from agent instance (/var/log/docker.log)

drawm commented 6 years ago

@ChaosCA The doc only state the volumes should be present, not why they are needed and what to expect if you mishandle them. Running mutiple agents with the provided example result in the same error you have and thus why I am reporting my findings here.

Anyway, good luck with your issue...

evgenisokolov commented 5 years ago

Hi! I have the same problem with the latest version of teamcity-docker-agent. Docker fails on stat up even when I trying to start it with 'service docker start' my deployment.yml: `apiVersion: apps/v1 kind: Deployment metadata: name: teamcity-agents-deployment labels: app: teamcity-agent spec: replicas: 3 selector: matchLabels: app: teamcity-agent template: metadata: labels: app: teamcity-agent spec: containers:

evgenisokolov commented 5 years ago

After some investigation I fixed it with adding: ` securityContext: capabilities: add:

to deployment configuration

hariseldon78 commented 4 years ago

Hi, i'm not sure if it's the same problem, but the /services/run-docker.sh script was running fine at container boot, but when i restarted my host server the docker service would start and after a bit it would error out and exit. i solved it by inserting a 'sleep 20' before running the docker service, and that made it work.

#!/bin/bash
if [ "$DOCKER_IN_DOCKER" = "start" ] ; then
 rm /var/run/docker.pid 2>/dev/null
 sleep 20
 service docker start
 echo "Docker daemon started"
fi
kir commented 4 years ago

Hello @hariseldon78 , could you please clarify a bit part when i restarted my host server the docker service would start and after a bit it would error out and exit. Could you describe your setup a bit more, I'm not sure I understand what can we do about it.

Thanks,

hariseldon78 commented 4 years ago

Hello @kir . My environment is this: an aws ec2 server, on it i am running docker. 2 docker containers: teamcity-server-instance and teamcity-agent-instance. The start command is like this:

docker run --name teamcity-server-instance  \
    -v $HOME/teamcity/data:/data/teamcity_server/datadir \
    -v $HOME/teamcity/logs:/opt/teamcity/logs  \
    -p 8111:8111 \
    --restart=always \
    jetbrains/teamcity-server

docker run --name teamcity-agent-instance \
    -e SERVER_URL="##########:8111"  \
    -v $HOME/teamcity/teamcity-agent:/data/teamcity_agent/conf  \
    -v docker_volumes:/var/lib/docker \
    --privileged -e DOCKER_IN_DOCKER=start \
    --restart=always \
    ###########/####-teamcity-agent

(the agent image is on my custom repository because i added some packages to it like nodejs and stuff) Before inserting the sleep 20 that i mentioned i created the two containers with the commands you see, then i tried stopping and restarting with docker stop - docker start, and everything was working fine, but when i tried restarting the aws ec2 instance i noticed that the 'docker in docker' feature was not working. So i tried docker exec -it teamcity-agent-instance bash into the agent container, and with /etc/init.d/docker status it reported the service turned off. With /etc/init.d/docker start it would start and work fine after that. So i tried logging into the container as soon as possible after the aws instance rebooted and i saw it was on, and then after some seconds it turned off, with an error message that i don't remember (i could dig into the log if really really needed). Then i looked here to understand how the dockerindocker service was started and i found the start script, i understood that it was somehow bound to the docker service running in the ec2 server, and so probably it was a boot order problem, added the sleep and boom! now it works every time.

kir commented 4 years ago

Hello @hariseldon78 ,

Thanks for the details. I'd really appreciate if you provide the error from the log, as it may clarify on why the docker has stopped in the container. We could add the sleep, but it is better to understand why it is needed at all. Also, how did you configure the TeamCity containers to auto-start on the ec2 instance?

Thanks again,

hariseldon78 commented 4 years ago

ok, here is the error (i just removed the sleep and it happened the first reboot):

docker exec -it teamcity-agent-instance bash
root@11d706dfe193:/# /etc/init.d/docker status
 * Docker is running
root@11d706dfe193:/# /etc/init.d/docker status
 * Docker is running
root@11d706dfe193:/# /etc/init.d/docker status
 * Docker is running
root@11d706dfe193:/# /etc/init.d/docker status
 * Docker is running
root@11d706dfe193:/# /etc/init.d/docker status
 * Docker is not running
root@11d706dfe193:/# date
Fri Oct 11 14:41:41 UTC 2019

(i repeated the docker status command about every 2-3 seconds)

this following is the /var/docker.log since the reboot.

time="2019-10-11T14:40:33.161042011Z" level=info msg="libcontainerd: docker-containerd is still running" pid=240
time="2019-10-11T14:40:33.162402970Z" level=info msg="parsed scheme: \"unix\"" module=grpc
time="2019-10-11T14:40:33.162423878Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
time="2019-10-11T14:40:33.164780453Z" level=info msg="ccResolverWrapper: sending new addresses to cc: [{unix:///var/run/docker/containerd/docker-containerd.sock 0  <nil>}]" module=grpc
time="2019-10-11T14:40:33.164807427Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
time="2019-10-11T14:40:33.164869425Z" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc42035a390, CONNECTING" module=grpc
time="2019-10-11T14:40:53.165142121Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///var/run/docker/containerd/docker-containerd.sock 0  <nil>}. Err :connection error: desc = \"transport: error while dialing: dial unix:///var/run/docker/containerd/docker-containerd.sock: timeout\". Reconnecting..." module=grpc
time="2019-10-11T14:40:53.165210502Z" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc42035a390, TRANSIENT_FAILURE" module=grpc
time="2019-10-11T14:40:53.165344758Z" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc42035a390, CONNECTING" module=grpc
time="2019-10-11T14:41:13.165472232Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///var/run/docker/containerd/docker-containerd.sock 0  <nil>}. Err :connection error: desc = \"transport: error while dialing: dial unix:///var/run/docker/containerd/docker-containerd.sock: timeout\". Reconnecting..." module=grpc
time="2019-10-11T14:41:13.165556900Z" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc42035a390, TRANSIENT_FAILURE" module=grpc
time="2019-10-11T14:41:13.165697424Z" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc42035a390, CONNECTING" module=grpc
time="2019-10-11T14:41:33.165721211Z" level=warning msg="Failed to dial unix:///var/run/docker/containerd/docker-containerd.sock: grpc: the connection is closing; please retry." module=grpc
Failed to connect to containerd: failed to dial "/var/run/docker/containerd/docker-containerd.sock": context deadline exceeded

Also, how did you configure the TeamCity containers to auto-start on the ec2 instance?

I used the '--restart always' option at the docker start or docker update.

kir commented 4 years ago

Hi @hariseldon78 ,

Thanks a lot for the log. I believe the error is related to this issue: https://github.com/docker/for-linux/issues/517. We'll update the docker version used in our image so the probably it will fix the problem with docker daemon stop.

I've filed this bug in our tracker as https://youtrack.jetbrains.com/issue/TW-62466

Best,

kir commented 4 years ago

This commit should fix the issue.