Azure / azure-workload-identity

Azure AD Workload Identity uses Kubernetes primitives to associate managed identities for Azure resources and identities in Azure Active Directory (AAD) with pods.
https://azure.github.io/azure-workload-identity
MIT License
288 stars 86 forks source link

Sidecar doesn't work properly for Microsoft ODBC Driver for SQL Server #593

Open thomasfrederikhoeck opened 1 year ago

thomasfrederikhoeck commented 1 year ago

Describe the bug I have a container that connect to a Azure Database using a User Managed Identity. It previously worked when using AAD Pod identity. I have tried enabling the sidecar (as I think the driver uses ADAL) and I can see in the logs that it make some calls but it somehow doesn't work when it gets back to the container.

Steps To Reproduce

1: Create an appropriate image with ODBC driver and python:

FROM python:3.9.15-buster

RUN curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add - \
&& curl https://packages.microsoft.com/config/debian/10/prod.list > /etc/apt/sources.list.d/mssql-release.list
RUN apt-get update && \
ACCEPT_EULA=Y apt-get install -y \
        msodbcsql17 \
        unixodbc-dev \
        mssql-tools 

ENV PATH="/opt/mssql-tools/bin:${PATH}"

RUN pip install sqlalchemy pyodbc
COPY script.py /script.py

where the script.py is:

from sqlalchemy import create_engine
import urllib
import os

server= os.getenv("SERVER")
uid = os.getenv("UID")
raw_params = (
"DRIVER={ODBC Driver 17 for SQL Server};"
f"SERVER={server};"
"Encrypt=yes;"
"TrustServerCertificate=no;"
"Connection Timeout=30;"
f"UID={uid};"
"Authentication=ActiveDirectoryMsi"
)
params = urllib.parse.quote_plus(raw_params)
connection_string = f"mssql+pyodbc:///?odbc_connect={params}"
engine = create_engine(connection_string)
engine.execute("select 1")

2: deploy to cluster with following yaml:

apiVersion: v1
kind: Pod
metadata:
  name: awli-test
  annotations:
    azure.workload.identity/inject-proxy-sidecar: "true"
spec:
  serviceAccountName: my-workload-identity
  containers:
  - args:
    - /bin/bash
    - -c
    - --
    - while true; do sleep 30; done;
    env:
      - name: SERVER
        value: URL_FOR_AZURE_SQL_SERVER
      - name: UID
        value: CLIENT_ID_FOR_USER_MANAGED_IDENITITY
    image: test-image:latest
    imagePullPolicy: IfNotPresent
    name: main
    resources:
      limits:
        cpu: "2"
        memory: 10G
      requests:
        cpu: 500m
        memory: 5G

Where URL_FOR_AZURE_SQL_SERVER, CLIENT_ID_FOR_USER_MANAGED_IDENITITY and test-image:latest are changed accordingly.

3: See logs before running script by running

~ kubectl logs  awli-test -c azwi-proxy
I1017 10:02:24.612091       1 proxy.go:96] proxy "msg"="starting the proxy server"  "port"=8000 "userAgent"="azure-workload-identity/proxy/v0.13.0 (linux/amd64) 46c5137/2022-08-31-23:38"
I1017 10:02:24.653617       1 proxy.go:167] proxy "msg"="received readyz request"  "method"="GET" "uri"="/readyz"
  1. Exec into container, run script and wait for failure:
    
    ~ kubectl exec --stdin --tty awli-test -- /bin/bash
    ~ python script.py
    Traceback (most recent call last):
    File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3361, in _wrap_pool_connect
    return fn()
    File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 320, in connect
    return _ConnectionFairy._checkout(self)
    File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 884, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
    File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 486, in checkout
    rec = pool._do_get()
    File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", line 146, in _do_get
    self._dec_overflow()
    File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
    File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
    File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", line 143, in _do_get
    return self._create_connection()
    File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 266, in _create_connection
    return _ConnectionRecord(self)
    File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 381, in __init__
    self.__connect()
    File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 678, in __connect
    pool.logger.debug("Error on connect(): %s", e)
    File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
    File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
    File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 673, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
    File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/create.py", line 578, in connect
    return dialect.connect(*cargs, **cparams)
    File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 598, in connect
    return self.dbapi.connect(*cargs, **cparams)
    pyodbc.OperationalError: ('HYT00', '[HYT00] [unixODBC][Microsoft][ODBC Driver 17 for SQL Server]Login timeout expired (0) (SQLDriverConnect)')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "//script.py", line 19, in engine.execute("select 1") File "", line 2, in execute File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/deprecations.py", line 402, in warned return fn(*args, kwargs) File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3256, in execute connection = self.connect(close_with_result=True) File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3315, in connect return self._connection_cls(self, close_with_result=close_with_result) File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 96, in init else engine.raw_connection() File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3394, in raw_connection return self._wrap_pool_connect(self.pool.connect, _connection) File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3364, in _wrap_pool_connect Connection._handle_dbapi_exception_noconnection( File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2198, in _handle_dbapi_exceptionnoconnection util.raise( File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_ raise exception File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3361, in _wrap_pool_connect return fn() File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 320, in connect return _ConnectionFairy._checkout(self) File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 884, in _checkout fairy = _ConnectionRecord.checkout(pool) File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 486, in checkout rec = pool._do_get() File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", line 146, in _do_get self._decoverflow() File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in exit compat.raise( File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_ raise exception File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", line 143, in _do_get return self._create_connection() File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 266, in _create_connection return ConnectionRecord(self) File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 381, in init self.connect() File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 678, in connect pool.logger.debug("Error on connect(): %s", e) File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in exit compat.raise( File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_ raise exception File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 673, in __connect self.dbapi_connection = connection = pool._invoke_creator(self) File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/create.py", line 578, in connect return dialect.connect(*cargs, *cparams) File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 598, in connect return self.dbapi.connect(cargs, cparams) sqlalchemy.exc.OperationalError: (pyodbc.OperationalError) ('HYT00', '[HYT00] [unixODBC][Microsoft][ODBC Driver 17 for SQL Server]Login timeout expired (0) (SQLDriverConnect)')

5. Observe that in the logs from the azwi-proxy that calls where made to get a token. (I have changed the CLIENT_ID to something random)

kubectl logs -n prefect awli-test -c azwi-proxy I1017 10:02:24.612091 1 proxy.go:96] proxy "msg"="starting the proxy server" "port"=8000 "userAgent"="azure-workload-identity/proxy/v0.13.0 (linux/amd64) 46c5137/2022-08-31-23:38" I1017 10:02:24.653617 1 proxy.go:167] proxy "msg"="received readyz request" "method"="GET" "uri"="/readyz" I1017 10:03:14.931967 1 proxy.go:101] proxy "msg"="received token request" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fdatabase.windows.net%2F&object_id=h869042n-6f42-5jg5-5g2f-48956645645f2" I1017 10:03:15.432570 1 proxy.go:101] proxy "msg"="received token request" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fdatabase.windows.net%2F&object_id=h869042n-6f42-5jg5-5g2f-48956645645f" I1017 10:03:16.433120 1 proxy.go:101] proxy "msg"="received token request" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fdatabase.windows.net%2F&object_id=h869042n-6f42-5jg5-5g2f-48956645645f" I1017 10:03:18.433681 1 proxy.go:101] proxy "msg"="received token request" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fdatabase.windows.net%2F&object_id=h869042n-6f42-5jg5-5g2f-48956645645f" I1017 10:03:33.455898 1 proxy.go:101] proxy "msg"="received token request" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fdatabase.windows.net%2F&object_id=h869042n-6f42-5jg5-5g2f-48956645645f" I1017 10:03:33.956334 1 proxy.go:101] proxy "msg"="received token request" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fdatabase.windows.net%2F&object_id=h869042n-6f42-5jg5-5g2f-48956645645f" I1017 10:03:34.956857 1 proxy.go:101] proxy "msg"="received token request" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fdatabase.windows.net%2F&object_id=h869042n-6f42-5jg5-5g2f-48956645645f" I1017 10:03:36.957431 1 proxy.go:101] proxy "msg"="received token request" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fdatabase.windows.net%2F&object_id=h869042n-6f42-5jg5-5g2f-48956645645f"


**Expected behavior**
That the SQL query would have worked like it did for AAD Pod Identity

**Logs**
See above

**Environment**

kubectl version Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:48:33Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"windows/amd64"} Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.15", GitCommit:"1d79bc3bcccfba7466c44cc2055d6e7442e140ea", GitTreeState:"clean", BuildDate:"2022-09-21T21:45:46Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}


- Kubernetes version (use `kubectl version`):
- Cloud provider or hardware configuration:
- OS (e.g: `cat /etc/os-release`):
- Kernel (e.g. `uname -a`):
- Install tools:
- Network plugin and version (if this is a network-related bug):
- Others:

**Additional context**
aramase commented 1 year ago

https://github.com/Azure/azure-workload-identity/pull/594 should be fix for this. I'll test this scenario to confirm.

aramase commented 1 year ago

@thomasfrederikhoeck the issue should be fixed with #594. Would you be able to try out aramase/proxy:v0.14.0-alpha.0 for the proxy?

thomasfrederikhoeck commented 1 year ago

Thank you for the quick response and fix @aramase ! The fix still requires sidecar to be used, right? :-) I'm not 100% how to test it as we have deployed the workload identity with az aks update -g <rg_name>-n <cluster_name>--enable-oidc-issuer --enable-workload-identity ? Also, when this is eventually merged will our cluster automatically use the fix because we deployed this way?

aramase commented 1 year ago

@thomasfrederikhoeck The image I shared above is for the sidecar. You can create a yaml similar to this with your application instead of adding the annotation to test if it works for your workload.

Also, when this is eventually merged will our cluster automatically use the fix because we deployed this way?

The fix is merged and will be part of next OSS release. After OSS release, AKS will use the latest release for the add-on and rollout to all clusters.

thomasfrederikhoeck commented 1 year ago

@aramase I didn't fix it for me :-/ I still see the same behavior when using mcr.microsoft.com/oss/azure/workload-identity/proxy-init:v0.13.0 with aramase/proxy:v0.14.0-alpha.0 when I apply:

apiVersion: v1
kind: Pod
metadata:
  name: awli-test
spec:
  serviceAccountName: my-workload-identity
  initContainers:
  - name: init-networking
    image: mcr.microsoft.com/oss/azure/workload-identity/proxy-init:v0.13.0
    securityContext:
      capabilities:
        add:
        - NET_ADMIN
        drop:
        - ALL
      privileged: true
      runAsUser: 0
    env:
    - name: PROXY_PORT
      value: "8000"
  containers:
  - name: proxy
    image: aramase/proxy:v0.14.0-alpha.0
    ports:
    - containerPort: 8000
    resources:
      limits:
        cpu: "2"
        memory: 10G
      requests:
        cpu: 500m
        memory: 2G
  - name: main
    args:
    - /bin/bash
    - -c
    - --
    - while true; do sleep 30; done;
    env:
      - name: SERVER
        value: $SERVER
      - name: UID
        value: $CLIENT_ID
    image: my-image
    imagePullPolicy: IfNotPresent
    resources:
      limits:
        cpu: "2"
        memory: 10G
      requests:
        cpu: 500m
        memory: 5G

I know it is not a networking issue as it worked for both AAD Pod identity and for Service Principal. Can you get it to work if you use the image that i provided?

aramase commented 1 year ago

I know it is not a networking issue as it worked for both AAD Pod identity and for Service Principal. Can you get it to work if you use the image that i provided?

@thomasfrederikhoeck for me the sample image doesn't work even without workload/pod identity. I just tried assigning the identity manually to the VMSS and deployed the pod which still fails with the same error.

thomasfrederikhoeck commented 1 year ago

@aramase weird! The above worked for me both in AAD Pod identity and it also currently works fine in Azure Functions with docker image and user-managed identity. Do you have a cluster with AAD pod identity where you can see if it works for you?

thomasfrederikhoeck commented 1 year ago

@aramase I just SSH'ed into one of our Azure Functions and ran the above python snip-it with success with an appropriate client_id and server address.

The driver version is:

~ odbcinst -q -d -n "ODBC Driver 17 for SQL Server"
[ODBC Driver 17 for SQL Server]
Description=Microsoft ODBC Driver 17 for SQL Server
Driver=/opt/microsoft/msodbcsql17/lib64/libmsodbcsql-17.10.so.1.1
UsageCount=1

So the above code works with User-Managed Identity in Azure Functions. :-) I it worked with AAD Pod Identity.

thomasfrederikhoeck commented 1 year ago

@aramase do you have an env where you can replicate? Or still no luck?

aramase commented 1 year ago

@aramase do you have an env where you can replicate? Or still no luck?

Still no luck. I'll try again later this week.

thomasfrederikhoeck commented 1 year ago

Hi @aramase :-) Have you had more time to look at the issue?

aramase commented 1 year ago

Hi @aramase :-) Have you had more time to look at the issue?

@thomasfrederikhoeck unfortunately, not yet! I merged a few more changes for the proxy. Could you try our nightly build ghcr.io/azure/azure-workload-identity/proxy:latest from the main branch?

vladkun commented 1 year ago

Hi @aramase, just FYI, proxy implementation also break java mssql-jdbc drivers because they expect expire_in field to be string, but it is a number - https://github.com/Azure/azure-workload-identity/blob/d7b71919d06cd963a77b90f0ca4a1ab6d4b83025/pkg/proxy/proxy.go#L59

There is parsing logic based on it, see https://github.com/microsoft/mssql-jdbc/blob/e4e783a0eb13d945b651f400f869cd1edead8f17/src/main/java/com/microsoft/sqlserver/jdbc/SQLServerSecurityUtility.java#L427 This applies to any version before latest, 12 (there was some serious refactoring and they replaced usage of expires_in with expires_at, which is unchanged in proxy)

Could this be fixed? Otherwise it ruins pretty much everything.

thomasfrederikhoeck commented 10 months ago

@aramase I'm still seeing this error. I tried deploying a new AKS cluster using the Workload Identity in GA. I tried both newest SQL Server ODBC 17 Driver and SQL ODBC 18 Driver and run the following code, but it still doesn't connect. I have also confimred that it is not a firewall issues since it works with a Service Principal. I have also checked that the Workload Identity works for other Azure resources such as a Key Vault. The application looks like this:

import sqlalchemy
import urllib
import os

import pandas as pd

server = "MY_AZURE_SQL_SERVER"
database = "MY_AZURE_SQL_DATABASE"
driver = "ODBC Driver 18 for SQL Server"
client_id_msi = "MY ID"

connect_params = [
        f"DRIVER={{{driver}}}",
        f"SERVER={{{server}}}",
        f"DATABASE={{{database}}}",
        "Encrypt=yes",
        "TrustServerCertificate=no",
        "Connection Timeout=30",
        f"UID={{{client_id_msi}}}", 
        "Authentication=ActiveDirectoryMsi"
    ]

params = urllib.parse.quote_plus(";".join(connect_params))
connection_string = f"mssql+pyodbc:///?odbc_connect={params}"

engine = sqlalchemy.create_engine(
        connection_string,
        fast_executemany=True,
)

pd.read_sql("select 1", engine)

It looks like there is also people at Microsoft hitting this issue: https://techcommunity.microsoft.com/t5/azure-database-support-blog/lesson-learned-384-odbc-driver-not-supporting-aks-workload/ba-p/3858209

tianqi-z commented 8 months ago

Hi, any updates on this? I'm also having issues with ODBC driver for workload identity. I was able to connect using ADO/jdbc connections strings, but having below error when using odbc. [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Login failed for user 'token-identified principal'.