Sitecore / docker-images

Build docker images for Sitecore
MIT License
178 stars 222 forks source link

SQL container fails boot when run in Kubernetes #298

Open NandGates opened 4 years ago

NandGates commented 4 years ago

Mapping the provided docker-compose files to a Kubernetes manifest leads to the SQL container failing to start.

I am using the completely unmodified Docker containers generated using Build.ps1 and the only change of note has been mapping the environment variables into Kubernetes format. No other code or architectural changes have been made.

I have been triaging this for some time now and have isolated it to the use of Invoke-SqlCmd which I believe has some strange timing issue when a Deployment initiates the boot of the container. Despite there being no password applied, the container returns

Login failed for NT AUTHORITY/ANONYMOUS LOGON

sqlcmd works flawlessly at all times.

Could we please consider removing Invoke-SqlCmd from Boot.ps1 both

and instead replace with sqlcmd -Q which is much more robust? We are already using this in the base mssql layer at windows\dependencies\mssql-developer-2017\Start.ps1 and it would be good to standardise.

I am literally running the below snippet in my Kubernetes manifest to replace Invoke-SqlCmd with sqlcmd which is working as expected.

command: ['powershell'] args: ['-c', '(Get-Content ./Boot.ps1) -replace "Invoke-SqlCmd -Query", "sqlcmd -Q" | Out-File Boot.ps1; C:/Boot.ps1 -InstallPath $env:INSTALL_PATH -DataPath $env:DATA_PATH'' ]

I am happy to supply a PR if you believe this change is useful and likely to be approved.

bplasmeijer commented 4 years ago

Can you share K8s manifest templates?

NandGates commented 4 years ago

Hi @bplasmeijer,

Thanks for the response, it is much appreciated. I'm aware you are busy!

I've included the Kubernetes deployment manifest below. For completeness the environment details are

I'm starting to believe this is a timing issue to do with the CMD being executed too early, I am considering implementing a health probe. If this addresses the issue I will post here.

kind: Deployment
metadata:
  name: sitecore-sql-xm
  namespace: sitecore
  labels:
    app: sitecore
spec:
  selector:
    matchLabels:
      app: sitecore
      role: xm-sql
  template:
    metadata:
      labels:
        app: sitecore
        role: xm-sql
    spec:
      containers:
        - name: xm-sql
          image: [REGISTRY]/sitecore-xm-sqldev
          imagePullPolicy: Always
          env:
          - name: SA_PASSWORD
            value: "8Tombs-Given-Clock#-arming-Alva-debut-Spine-monica-Normal-Ted-About1-chard-Easily-granddad-5Context!"
          - name: ACCEPT_EULA
            value: "Y"
      nodeSelector:
        agentpool: win
NandGates commented 4 years ago

I can confirm this is due to a timing issue with container readiness. Unfortunately AKS currently does not support startupProbe declarations (https://github.com/Azure/AKS/issues/1550) but a readinessProbe as below has solved the issue.

Closing but also posting the solution below for any future readers.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sitecore-sql-xp
  namespace: sitecore
  labels:
    app: sitecore
spec:
  selector:
    matchLabels:
      app: sitecore
      role: xp-sql
  template:
    metadata:
      labels:
        app: sitecore
        role: xp-sql
    spec:
      containers:
        - name: xp-sql
          image: [REPOSITORY]/sitecore-xp-sqldev
          imagePullPolicy: Always
          env:
          - name: SA_PASSWORD
            value: "8Tombs-Given-Clock#-arming-Alva-debut-Spine-monica-Normal-Ted-About1-chard-Easily-granddad-5Context!"
          - name: ACCEPT_EULA
            value: "Y"
          readinessProbe:
            tcpSocket:
              port: 1433
            failureThreshold: 30
            periodSeconds: 10
      nodeSelector:
        agentpool: win
NandGates commented 4 years ago

Unfortunately this issue persists even with the readinessProbe.

Best as I can tell in Kubernetes the Entrypoint is being invoked too early, and somehow Invoke-SqlCmd is erroring and thus caching invalid credentials. sqlcmd directly does not have this problem.

Interestingly the error returned from Invoke-SqlCmd is

Login failed for NT AUTHORITY/ANONYMOUS LOGON

But when I run whoami inside the container I get (as expected)

usermanager/container administrator

When I get the SQL identity (using sqlcmd -Q "SUSER_NAME()")

then as expected I get

usermanager/container administrator

I honestly have no explanation for this, so I'm reverting to my original request which is for the standardisation of sqlcmd rather than Invoke-SqlCmd as this is what is used by base images in the Docker process. As mentioned I am happy to submit a PR if it is likely to be useful and approved.

As a final note, Microsoft seem to be standardising on the use of sqlcmd in their tooling and scripts on MSDN also, so this would align to vendor practice.

pbering commented 4 years ago

May I suggest that you use the Linux images for SQL and Solr instead? I know for sure that they are working in Kubernetes and they are also faster and uses less resources.