Problem accessing public files on Dropbox #362

Closed tloredo closed 1 year ago

tloredo commented 1 year ago

Thank you for Pooch! I'd appreciate any pointers regarding this problem. Perhaps I need to use the Dropbox API for this purpose, but it's peculiar that wget has no problem with the URLs that Pooch is having trouble with.

This may be a feature request rather than an actual bug (re: Dropbox support).

Description of the problem:

I'm working with a student on a Python package that includes large data files. When the data files are finalized, we'll put them somewhere fairly permanent, like Zenodo or Dataverse. But while we're revising things, I've put the files on Dropbox. But when Pooch retrieves them via their public Dropbox URLs, it gets HTML files with a "Couldn't preview this file" message instead of the data (.h5 files). Yet retrieving the files with wget using the same URLs we're providing to Pooch retrieves the binary .h5 files with no problem.

Full code that generated the error

import pooch

fetcher = pooch.create(
    "lambda-3923-4010-phases-100.h5" : None,
    "lambda-3923-6664-phases-4.h5" : None
    # Now specify custom URLs for some of the files in the registry.
        "lambda-3923-4010-phases-100.h5" : "https://www.dropbox.com/s/5b9m1pq5qif5obf/lambda-3923-4010-phases-100.h5?dl=0",
        "lambda-3923-6664-phases-4.h5" : "https://www.dropbox.com/s/pyeapovhk4q6az0/lambda-3923-6664-phases-4.h5?dl=0"

# These paths end up pointing to files named as indicated, but containing HTML
# corresponding to a Dropbox "can't preview" response:
full_spec_path = fetcher.fetch("lambda-3923-6664-phases-4.h5")
ca_spec_path = fetcher.fetch("lambda-3923-4010-phases-100.h5")

Full error message

There is no error message; rather, Pooch triggers an attempted preview from Dropbox instead of accessing the actual data files.

System information

Output of conda list

santisoler commented 1 year ago

Hi @tloredo. Thanks for opening this issue.

What Pooch is doing is downloading the HTML file that Dropbox gives you when you access to https://www.dropbox.com/s/5b9m1pq5qif5obf/lambda-3923-4010-phases-100.h5?dl=0. The Download button is not a regular anchor with a static link, but a dynamic button that triggers the download of the desired file.

After reading Dropbox's docs, you could force it to give you the download link by replacing the trailing dl=0 for dl=1. I just try it out and it downloaded a binary file, which I suppose is the hd5 file you want to fetch.

I'm not sure how wget is able to download the file, even if you pass the dl=0. Maybe it parses the url and uses the one with dl=1 instead.

BTW, if you are going to pass custom urls for every file in the registry, you could use and empty string for your base_url, since it won't be used at any point.

This should work:

import pooch

fetcher = pooch.create(
    "lambda-3923-4010-phases-100.h5" : None,
    "lambda-3923-6664-phases-4.h5" : None
    # Now specify custom URLs for some of the files in the registry.
        "lambda-3923-4010-phases-100.h5" : "https://www.dropbox.com/s/5b9m1pq5qif5obf/lambda-3923-4010-phases-100.h5?dl=1",
        "lambda-3923-6664-phases-4.h5" : "https://www.dropbox.com/s/pyeapovhk4q6az0/lambda-3923-6664-phases-4.h5?dl=1"

# These paths end up pointing to files named as indicated, but containing HTML
# corresponding to a Dropbox "can't preview" response:
full_spec_path = fetcher.fetch("lambda-3923-6664-phases-4.h5")
ca_spec_path = fetcher.fetch("lambda-3923-4010-phases-100.h5")
tloredo commented 1 year ago

@santisoler, thanks so much for the quick and helpful response (and the tip about the base url string)! dl=1 does solve this problem. I wonder what wget is doing, but in any case this Pooch issue is solved.

santisoler commented 1 year ago

Glad to be helpful! 🙂

tloredo commented 1 year ago

Just a further followup: Some poking on Stack Exchange, after geting @santisoler's solution, suggests that Dropbox recognizes some user agents and handles requests from them in special ways. See, e.g., linux - how to download dropbox files using wget command? - Super User and curl - User-Agent affects Dropbox shared links download - Stack Overflow. The lesson being that having a Dropbox URL work with some user agents doesn't mean that URL will work for others.