databricks-cli: Authentication error using token for service principal

sondrebouvet commented 8 months ago

Steps to reproduce the behavior

Using the composite github action 'setup-cli' (https://github.com/databricks/setup-cli), an existing pipeline fails. Installing databricks-cli using latest version by curl, yields same issue. This indicates that issue is not related to composite action, but rather databricks-cli itself. The authentication method is a databricks personal access token generated by a service principal. This token has been tested and is valid. The token can be used sucessfully in github actions using the databricks API 2.0 endpoints directly. The issue is restricted to the newest version of databricks-cli. Legacy pip-based cli works fine, thus eliminating possibility of incorrect setup of workspace or token.

OS and CLI version

Github actions ubuntu-latest. Databricks cli: v0.212.4

Is this a regression?

Appears to be a regression, as pip databricks-cli works fine.

Debug Logs

Error: unexpected error handling request: json: cannot unmarshal number into Go struct field APIErrorBody.error_code of type string. This is likely a bug in the Databricks SDK for Go or the underlying REST API. Please report this issue with the following debugging information to the SDK issue tracker at https://github.com/databricks/databricks-sdk-go/issues. Request log:

GET /api/2.0/dbfs/get-status?path=/FileStore/wheels/test_packages
> * Host: 
> * Accept: application/json
> * Authorization: REDACTED
> * User-Agent: cli/0.212.4 databricks-sdk-go/0.30.1 go/1.21.6 os/linux cmd/fs_cp auth/pat cicd/github
< HTTP/2.0 [40](https://github.com/<repo>/actions/runs/7899481440/job/21559053513#step:10:41)3 Forbidden
< * Access-Control-Allow-Headers: Authorization, X-Databricks-Azure-Workspace-Resource-Id, X-Databricks-Org-Id, Content-Type
< * Access-Control-Allow-Origin: *
< * Cache-Control: no-cache, no-store, must-revalidate
< * Content-Length: [45](https://github.com/<repo>actions/runs/7899481440/job/21559053513#step:10:46)
< * Content-Type: application/json; charset=utf-8
< * Date: Wed, 14 Feb 2024 10:04:03 GMT
< * Expires: 0
< * Pragma: no-cache
< * Server: databricks
< * Vary: Accept-Encoding
< * X-Databricks-Reason-Phrase: user not found
< {
<   "error_code": 403,
<   "message": "user not found"
< }

pietern commented 8 months ago

Thanks for reporting.

In the debug log, did you redact the host field, or was it empty in the trace?

sondrebouvet commented 8 months ago

Thanks for reporting.

In the debug log, did you redact the host field, or was it empty in the trace?

It was empty, but we have confirmed that hostname is correct and seemingly not related to issue

pietern commented 8 months ago

Can you share how you're invoking the CLI from the action? E.g. are you using a profile, or setting env vars, and if so, how?

sondrebouvet commented 8 months ago

Can you share how you're invoking the CLI from the action? E.g. are you using a profile, or setting env vars, and if so, how?

Setting environment in step and invoking databricks fs cp to dbfs location

      - name: Deploy .whl to Databricks DBFS
        env: 
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
          PACKAGE_NAME: ${{ inputs.package_name }}
          PACKAGE_FOLDER: ${{ inputs.package_folder }}
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
        run: |
          # Copy wheel package
          databricks fs cp "dist/${{ env.PACKAGE_NAME}}"  "dbfs:/FileStore/wheels/${{ env.PACKAGE_FOLDER }}${{ env.PACKAGE_NAME}}" --overwrite

pietern commented 8 months ago

If I understand correctly, you were previously using the legacy (Python) CLI in the same action, then replaced it with this one by using the setup-cli action and it stopped working?

Note that there are expected incompatibilities between the legacy CLI and this one. The cp command, however, should be compatible between these versions.

Can you confirm that other API calls fail as well? E.g. you could include a step where you run:

databricks current-user me

This prints out the user you're logged in as (who owns the token). If the cp command fails, I expect that to fail as well.

The action setup looks good.

sondrebouvet commented 8 months ago

Some context that I should have included in the orginal issue: We have been using legacy Python CLI up until about a month ago. We switched to new CLI, not changing the original copy command, as syntax is identical. This worked fine, until latest release. It still works in our dev environment, for reasons not clear to us. As a sanity check, we have tested the following:

Tested the token locally using API, and databricks-cli (older version): works fine Tested a different API call using databricks-cli, we tried databricks clusters list which did not work as expected either. Tested github actions jobs using API and legacy CLI, using same token and workspace, which works fine.

sondrebouvet commented 8 months ago

Our temporary solution, as of now, has been to use legacy python CLI. As mentioned before, we have however used new databricks-cli sucessfully. The error discussed in this issue, first appeared to us when running a release in our production environment. Our initial thought was that this behavior was caused by a config-mismatch between our dev and prod environments. However, this cannot be the case as legacy cli as well as API (using curl) still works fine in both envs

pietern commented 8 months ago

Thanks for the additional information.

Can you confirm the last version of the new CLI that did work? I.e. did it work with v0.212.3 and started failing with v0.212.4? Are there other env vars at play, or perhaps a .databrickscfg when trying this locally?

If you have a concrete repro, as in, it works with version X but not with version Y, then we could bisect and look at what changed between those versions.

sondrebouvet commented 8 months ago

Sorry for taking a while getting back to you. I have now looked through many older versions of databricks-cli using setup composite action. I have found that version v0.200.1 works (by using reference 3f1981093bda661acaa5dccb3a191d3e146f6327 from setup repo). I assume that newer versions between latest (tested v0.212.4) and v0.200.1 also might work. I will look into it further

sondrebouvet commented 8 months ago

I tested multiple versions using API calls databricks clusters list and databricks fs copy <path> <dbfs_path>, borth of which does not work in latest version, but does work in v0.200.1

sondrebouvet commented 8 months ago

Further testing shows that affecting change happens in commit between version v0.203.1 and v0.203.2.

pietern commented 8 months ago

Thanks for digging in. And to confirm, once you're on v0.203.2, you see the error you included in the issue summary?

sondrebouvet commented 8 months ago

Almost, I think there has been some changes to the actual traceback, but the error is the same,

Run # Copy wheel package Error: Response from server (403 Forbidden) {"error_code":403,"message":"user not found"}: json: cannot unmarshal number into Go struct field APIErrorBody.error_code of type string Error: Process completed with exit code 1.

sondrebouvet commented 8 months ago

Any updates here @pietern ?

Abdul-Arfat-Mohammed commented 7 months ago

Even with the old cli, I'm facing this same issue with user PAT. This used to work earlier, nothing has been changed in the config.

Weirdly enough, the GitHub Action works fine in our dev env. It fails only in our prod env.

I even tested the prod token from my local machine, below are my findings:

databricks jobs list --all --version=2.1 → This works fine
databricks jobs reset --json-file $jsoncontent --job-id $jobidtoedit --version=2.1 → This fails weirdly throwing below error:
```
Error: Authorization failed. Your token may be expired or lack the valid scope
```

I'm confused by this behaviour. How would listing jobs work with the same token?

I'm using this to install the cli.

- name: install-databricks-cli
   uses: microsoft/install-databricks-cli@v1.0.0

@sondrebouvet @pietern I would love to know your thoughts on this.

Abdul-Arfat-Mohammed commented 7 months ago

Even with the old cli, I'm facing this same issue with user PAT. This used to work earlier, nothing has been changed in the config.

Weirdly enough, the GitHub Action works fine in our dev env. It fails only in our prod env.

I even tested the prod token from my local machine, below are my findings:

databricks jobs list --all --version=2.1 → This works fine

databricks jobs reset --json-file $jsoncontent --job-id $jobidtoedit --version=2.1 → This fails weirdly throwing below error:
Error: Authorization failed. Your token may be expired or lack the valid scope
I'm confused by this behaviour. How would listing jobs work with the same token?

I'm using this to install the cli.
- name: install-databricks-cli
   uses: microsoft/install-databricks-cli@v1.0.0
@sondrebouvet @pietern I would love to know your thoughts on this.

Update: Found the root cause using the --debug option, it's regarding the user permissions to use the service principal.

sondrebouvet commented 7 months ago

@Abdul-Arfat-Mohammed, if all versions of the CLI does not work for you when authenticating with your production environment, I don't feel it's related to this issue. Our issue seems related to a change which occured in version v0.203.2

Abdul-Arfat-Mohammed commented 7 months ago

@Abdul-Arfat-Mohammed, if all versions of the CLI does not work for you when authenticating with your production environment, I don't feel it's related to this issue. Our issue seems related to a change which occured in version v0.203.2

@sondrebouvet Thanks for the confirmation. Yes, I agree.

Update: Found the root cause using the --debug option, it's regarding the user permissions to use the service principal.

I have updated in my previous comment.

grazianom-tuidi commented 6 months ago

Hello, I'm trying to perform a similar workflow setting a github action like the one mentioned by @sondrebouvet. I've noticed that the issue of authentication started when I updated my VsCode Databricks Extension to preview, in order to use Databricks Assets Bundle. My current flow requires that I upload a wheel file to multiple workspaces, performing a databricks fs cp ... in several steps using different pairs of _DATABRICKSTOKEN and _DATABRICKSHOST each time. If I don't upload databricks.yml to Git, the action run with no issues. On the other hand, if I upload it, it seems that the CLI is trying to connect to the default target workspace indicated in databricks.yml.

Could it be that there's some environment variable or something similar that is set with that file and it's evaluated with higher priority with respect to DATABRICKS_HOST? Right now, the only fix seem to be to make a script to run databricks configure --token for each workspace I need to use.

andrewnester commented 6 months ago

@grazianom-tuidi you can try to run databricks auth describe command and provide an output here. It shows which auth is used, where parameters are coming from and etc.

grazianom-tuidi commented 6 months ago

Sure, here's the output.

Unable to authenticate: default auth: azure-cli: cannot get access token: ERROR: Please run 'az login' to setup account.
. Config: host=https://adb-*****.azuredatabricks.net/
-----
Current configuration:
  ✓ host: https://adb-****.azuredatabricks.net/ (from bundle)
  ✓ profile: default

Right now, it seems to fail even for the first workspace due to az login required. I might have failed in reverting things back to my previous settings, here's my action:

      - name: Upload the wheels
        uses: actions/upload-artifact@v3
        with:
          name: Upload wheel
          path: dist/*.whl

      - name: Run auth describe
        run: databricks auth describe

      - name: Push to DBFS (Report)
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST_REPORT }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_REPORT }}
        run: |
          # Copy wheel files
          databricks fs mkdir dbfs:/FileStore/libraries
          for f in dist/*.whl; do
              databricks fs cp $f dbfs:/FileStore/libraries/test-latest-py3-none-any.whl --overwrite
          done

      - name: Push to DBFS (dev)
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST_DEV }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_DEV }}
        run: |
          # Copy wheel files
          databricks fs mkdir dbfs:/FileStore/libraries
          for f in dist/*.whl; do
              databricks fs cp $f dbfs:/FileStore/libraries/test-latest-py3-none-any.whl --overwrite
          done

From the databricks auth describe command it seems that it's using the host defined by bundle, so trying to overwrite it using env variable should not work. Is this the expected behaviour?

andrewnester commented 6 months ago

@grazianom-tuidi yes, this is an expected behaviour at the moment, see for the details https://github.com/databricks/cli/issues/1358

grazianom-tuidi commented 6 months ago

I see, since there's no way to override the bundle configs (aside from listing all possible targets in databricks.yml) I guess I'll stick to the old CLI which seems to give priority to variables defined in .databrickscfg even with bundle configs already set.

andrewnester commented 3 months ago

@sondrebouvet is the issue still present for you on the very latest CLI version?

SophieBlum commented 3 months ago

Hi, I am taking over for @sondrebouvet here :) With the very latest CLI version, we still get an error with the databricks fs cp command, but the error itself changed:

# Copy wheel package
  databricks fs cp "dist/${{ env.PACKAGE_NAME}}"  "dbfs:/FileStore/wheels/${{ env.PACKAGE_FOLDER }}${{ env.PACKAGE_NAME}}" --overwrite
...

Error: Invalid access to Org: 5435654711470629

As before, the exact same setup works fine with the older CLI version and in our dev environment (with the latest CLI version).

Francesco-Ranieri commented 2 weeks ago

One way to prioritize the .databrickscfg in the latest version of databricks-cli is to run the command from a folder that does not contain the databricks.yml. In the context of a GitHub actions, I solved it as follows:

# Create .databrickscfg
echo "[DEFAULT]" > .databrickscfg 
echo "host = ${{ secrets.HOST }}" >> .databrickscfg 
echo "token = ${{ secrets.TOKEN }}" >> .databrickscfg 
cd ..
# Copy wheel files
for f in <repo_name>/dist/*.whl; do 
    databricks fs cp $f dbfs:/<path> --overwrite
done

databricks / cli