astronomy-commons / hipscat

Hierarchical Progressive Survey Catalog
https://hipscat.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
14 stars 3 forks source link

Inconsistent handling of headers in storage_options for pandas read methods #295

Open Schwarzam opened 2 weeks ago

Schwarzam commented 2 weeks ago

I encountered an issue when trying to use JWT authentication with pandas file read methods, such as read_parquet. The problem arises due to the different ways headers need to be specified for HTTP(S) URLs when using pandas and fsspec.

Typically, the header for a request with JWT authentication looks like this:

{
    "headers": {"Authorization": "Token XXXXXXX"}
}

When accessing files, storage_options is used to send these headers. However, there is an inconsistency in how pandas and fsspec handle these headers. While fsspec expects the storage options to include the "headers" key as shown above, pandas expects the key-value pairs to be forwarded directly as header options without the "headers" key

According to the pandas.read_parquet (applies to all read methods) documentation: For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options.

Thus, for HTTP connections, the storage options should be formatted as follows:

{
    "Authorization": "Token XXXXXXX"
}

This discrepancy causes errors in pandas read methods on (file_io.py), such as read_parquet_file_to_pandas and ``.

Suggested Solution

To resolve this issue, I suggest this lines before methods that reads using pandas to correct just the headers in the storage_options. Here's the suggested code snippet:

if storage_options is not None and "headers" in storage_options:
    headers = storage_options.pop("headers")
    storage_options = {**storage_options, **headers}

I tested this locally and it works. Don't know if its better to create a function to not repeat the pattern.