fatiando / pooch

A friend to fetch your data files
https://www.fatiando.org/pooch
Other
626 stars 74 forks source link

Add support for downloading from Azure cloud storage #382

Open FlorisCalkoen opened 9 months ago

FlorisCalkoen commented 9 months ago

Edit by @leouieda on 2024-02-19

Add a AzureDownloader that can fetch the data from Azure cloud storage. It should support an authentication token, ideally with the option to read it from an environment variable.


Original issue 👇🏾

Description of the desired feature:

Would it be possible to add support for fetching data from private cloud containers?

import os

import dotenv
import pooch
import pandas as pd

dotenv.load_dotenv(override=True)
sas_token = os.getenv("AZURE_STORAGE_SAS_TOKEN")

storage_options = {"account_name": "storage_account_name", "account_key": sas_token}

href = "az://some/private/container/file.parquet"
fp = pooch.retrieve(href, known_hash=None, storage_options=storage_options)
pd.read_parquet(fp)

# this currently works for azure, but I'm not sure if its the best approach
href = "az://some/private/container/file.parquet" + sas_token
fp = pooch.retrieve(href, known_hash=None)
pd.read_parquet(fp)

Are you willing to help implement and maintain this feature? Maybe, yes!

remrama commented 9 months ago

@FlorisCalkoen I just made a custom Downloader like this, but for Google Cloud Storage. If it's useful to you, I linked it in a comment on a similar Issue thread (#363).

leouieda commented 8 months ago

Hi @remrama @FlorisCalkoen @WesleyTheGeolien I have 0 experience with cloud containers but since multiple people have requested this than we can look into it.

As @remrama said, this would be best implemented as a downloader. It could take the token as input but could also take a name of an environment variable and do the reading for you.

From what I gather, each cloud would have their own API for fetching the data so they'd need separate implementations. Since Pooch is supposed to be a very lightweight dependency for other projects, any downloader that requires a new dependency would have to make that dependency optional. We already do this for SFTP for example.

I'll edit this issue and #363 to make them explicitly about AWS and Azure. @remrama would you mind opening a new one for Google Cloud Storage and include the link to your code?

If either of you would like to implement this, then it would be great! We'd need:

  1. A new downloader (GCSDownloader, AWSDownloader, AzureDownloader) in pooch/downloaders.py (see https://www.fatiando.org/pooch/latest/downloaders.html and the existing downloaders). Make sure to add it to the choose_downloader function so that Pooch can automatically find it based on the prefix (az: etc).
  2. The test data in our data folder uploaded to the storage so we can test that it works.
  3. Tests in pooch/tests/test_downloaders.py that check if the download works and that any errors that should be raised are actually raised.
  4. Example documentation, probably in https://www.fatiando.org/pooch/latest/protocols.html

Not sure what the pricing model is for these providers (which is why I never bothered with them) but if it's not possible to have our test data on them so that we can very the functionality then I think it's best to leave the downloader outside of Pooch itself.