Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.56k stars 2.79k forks source link

Data Lake: Iterating over result of `get_paths` method on `FileSystemClient` raises HTTP error #35617

Closed ShivnarenSrinivasan closed 1 month ago

ShivnarenSrinivasan commented 4 months ago

Describe the bug Calling the get_paths method, and iterating over the result is throwing a HTTPResponseError.

HttpResponseError: (InvalidQueryParameterValue) Value for one of the query parameters specified in the request URI is invalid.

To Reproduce Steps to reproduce the behavior:

import os
from azure.storage.filedatalake DataLakeServiceClient
from azure.identity import ClientSecretCredential

ACCOUNT_NAME = os.getenv('ACCOUNT_NAME')

credential = ClientSecretCredential(os.getenv('TENANT_ID'), os.getenv('CLIENT_ID'), os.getenv('CLIENT_SECRET'))
service = DataLakeServiceClient(account_url=f"https://{ACCOUNT_NAME}.dfs.core.windows.net", credential=credential)
# this connection works for creating, deleting, and modifying files and directories

filesystem = service.get_file_system_client('data')
for path in filesystem.get_paths('tmp'):
    print(path.name)
    # raises http exception instead

Expected behavior The iterator object returned by get_paths should yield valid filesystem files/directories.

Screenshots Error raised upon iteration: image

Additional context I do not believe this is a direct bug in the SDK, as I was able to replicate this issue while calling the underlying REST API directly--however, hoping there is some insight on the overall process. In any case, if the method does not work as documented, perhaps some changes are necessary.

github-actions[bot] commented 4 months ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jalauzon-msft @vincenttran-msft.

vincenttran-msft commented 4 months ago

Hi @ShivnarenSrinivasan , thanks for the inquiry! After taking a closer look at your RequestId, we were able to come up with a successful repro of the error you are facing! The sample snippet code provided above actually does not face this issue, and so here is a seperate code example that should help explain what is the root cause.

image Here is an example screenshot from my Azure Portal.

With that being said, taking a look at your RequestId reveals that you are wrongly passing in a directory to the get_file_system_client API.

For example, your code that reproduces the failure in this example would look like: service.get_file_system_client('filesystemlevel/firstdirectorylevel')

Whereas the correct code snippet would look like: service.get_file_system_client('filesystemlevel')

Then, if your goal is to drill down to the paths in tmp, you would pass: filesystem.get_paths('firstdirectorylevel/seconddirectorylevel/tmp')

In short, the root cause of the issue is that you were specifying more than just the file system when getting a file system client. Hopefully this example makes sense and should unblock your workflow, otherwise please do not hesitate to reach out again!

Thanks!

ShivnarenSrinivasan commented 4 months ago

Thanks a lot, @vincenttran-msft -- in an attempt to simplify the code I was working with, I seem to have left out the most critical detail. My apologies. One piece to add, is I am part of an organization where I do not have admin privileges, and the directory I was trying to access was merely provisioned for me; hence I was unaware of the container/directory distinction.

This explanation is very helpful, and after making the changes, I'm good to go. The issue I raised is certainly closed, but this does feel like a "gotcha" to an uninitiated user (esp. since the HTTPException is so generic).

Docs

I am not well versed with the terminology yet, but I couldn't find any specification of what a filesystem represents, or the restrictions (i.e should be a container) in the docs (which I believe is the README in the relevant git directory). Would it be worth adding some? Happy to submit a PR, it could be in the README, or in the get_file_system method itself.

Runtime Check

Further, I don't know if this is a correct assumption, but if containers cannot be nested--that means that the only valid argument to the get_file_system call would be a root level path. Would it be appropriate to add a runtime check to ensure there only a single path (i.e top level) is passed, rather than what I did?

vincenttran-msft commented 1 month ago

Hi @ShivnarenSrinivasan , sorry for the delay as this issue completely fell off my radar. After a brief audit of the current README, I found that the information contained in the README (and especially the code samples) sufficiently expressed the hierarchy of the azure-storage-file-datalake package.

However, I did notice a gap in documentation regarding each specific client, and I agree that we can do better in this area and that by linking to each relevant client's documentation, this can help future developers and could have helped prevent you from encountering this issue!

With that being said, I have just put up a pull-request containing these amendments to the documentation, and with your above response showing that the issue has been resolved on your end, we will close out this issue alongside the successful merge of the pull request above to completely mark the issue as addressed! Thanks again for your patience and contribution!