fatiando / pooch

A friend to fetch your data files
https://www.fatiando.org/pooch
Other
630 stars 76 forks source link

Add support for downloading from Google Cloud Storage #398

Open remrama opened 9 months ago

remrama commented 9 months ago

Add a GCSDownloader that can fetch the data from Google Cloud Storage. It should support an authentication token, ideally with the option to read it from an environment variable.

See matched feature requests for other cloud storage services from Amazon's AWS (#363) and Microsoft's Azure (#382).

This would require:


I've got a fully functional GCSDownloader class here in a fork, but minus the testing. It uses the google-cloud-storage package for authentication/downloading, which can be passed as a token to the downloader or read from an environment variable. It allows usage of the tqdm progress bar option.

# Authorize by setting an environment variable
import os
import pooch
credentials = "google_app_credentials.json"
url = "gs://bucket_name/blob_name.txt"
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = credentials
filename = pooch.retrieve(url, known_hash=None)

# Authorize by passing credentials to custom downloader
from pooch import GCSDownloader
credentials = "google_app_credentials.json"
downloader = GCSDownloader(credentials=credentials)
filename = pooch.retrieve(url, known_hash=None, downloader=downloader)

I can't speak to long-term maintenance, but I would be interested in adding tests and submitting a PR within the next month.

leouieda commented 9 months ago

Thanks @remrama! We'd be happy to have this.

leouieda commented 5 months ago

Hi @remrama we've been thinking quite a lot about whether we should add this to pooch itself or if it would be better as a separate project that provides the downloaders only.

I think the main issue is testing all of this. With Zenodo and figshare we can be pretty certain that the test data will stay there in the long term. But with all of these cloud providers, can we trust that our test data will be there? Can we update it without the original uploader? Is it free?

I don't use these cloud storages so I don't the answer to these.

remrama commented 5 months ago

@leouieda we are on the exact same page. My hold-up on this was all about trying to come up with the best way to run the testing. I don't think there's a way to properly run tests without using a private google account (including fees for the calls, even if small).

I was sitting on it, thinking a solution might pop up, and in the meantime I've been playing around with the Zenodo downloader. I've become very appreciate of this feature in pooch. I find the Zenodo (and figshare) downloaders to be incredibly convenient. And for people who are trying to download datasets that they can't make public, Zenodo even offers private repos. While this won't solve everyone's needs, I think the current DOIdownloaders are sufficient for the practical minimalism of pooch.

I vote to exclude this feature. I think I'll just keep my GCSDownloader in a public fork or even just a Gist file and pull it down whenever I need it. At most, maybe you'd want to add an example in the docs showing this approach, but even that I'm not so sure about.

leouieda commented 5 months ago

@remrama good to know! I also use the DOI downloaders quite a lot myself.

I was speaking with @santisoler about possibly creating some form of plugin system for Pooch downloaders. The idea would be that other packages can implement custom downloaders associated with different protocols and Pooch could find them and hook them up to the machinery that matches protocols in URLs to downloader classes. But this is a bit beyond what I have time for lately.

In the mean time, if you want help distributing your GCSDownloader class as a standalone package, we can help with that.

remrama commented 5 months ago

Sounds good, thanks. I'm not so familiar with the plug-in system, but it sounds like a good idea for this feature. As for my current plans for implementing the GCSDownloader, I don't really have one right now. The existing DOIDownloader has been satisfying all my needs. If I'm in need of a more accessible GCSDownloader again, I'll probably look back into these more convenient packaging options and revive this idea. It's not so far-fetched. I imagine I will need it at some point, just not so sure how soon.