flatironinstitute / deepblast

Neural Networks for Protein Sequence Alignment
BSD 3-Clause "New" or "Revised" License
114 stars 20 forks source link

Weights can't be downloaded with python (User-Agent blocked) #89

Closed konstin closed 2 years ago

konstin commented 3 years ago

The webserver blocks python-urllib as user agent for downloading the weights:

$ curl -H "User-Agent: Python-urllib/3.8" https://users.flatironinstitute.org/jmorton/public_www/deepblast-public-data/checkpoints/deepblast-lstm4x.pt
error code: 1010
$ curl -H "User-Agent: not-python/3.8" https://users.flatironinstitute.org/jmorton/public_www/deepblast-public-data/checkpoints/deepblast-lstm4x.pt
<html>
<head><title>301 Moved Permanently</title></head>
<body>
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.16.1</center>
</body>
</html>

It's not hard to fake a different user agent, but I don't think that the standard python snippet for downloading the weights should be blocked.

mortonjt commented 3 years ago

I'm a little confused why this error is occurring ...

Does wget work for you? I'm able to download with the following command

wget https://users.flatironinstitute.org/jmorton/public_www/deepblast-public-data/checkpoints/deepblast-lstm4x.pt

On Tue, Dec 1, 2020 at 7:38 AM konstin notifications@github.com wrote:

The webserver blocks python-urllib as user agent for downloading the weights:

$ curl -H "User-Agent: Python-urllib/3.8" https://users.flatironinstitute.org/jmorton/public_www/deepblast-public-data/checkpoints/deepblast-lstm4x.pt error code: 1010 $ curl -H "User-Agent: not-python/3.8" https://users.flatironinstitute.org/jmorton/public_www/deepblast-public-data/checkpoints/deepblast-lstm4x.pt

301 Moved Permanently

301 Moved Permanently


nginx/1.16.1

It's not hard to fake a different user agent, but I don't think that the standard python snippet for downloading the weights should be blocked.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/deepblast/issues/89, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA75VXKU3AOGVLETDOZWFYDSST5VBANCNFSM4UJDUJXA .

konstin commented 3 years ago

The problem occurs when using python, curl was just for demonstrating that the problem is the user agent:

from urllib import request
request.urlretrieve("https://users.flatironinstitute.org/jmorton/public_www/deepblast-public-data/checkpoints/deepblast-lstm4x.pt", filename="deepblast-lstm4x.pt")

What I believe that is happening is that the webserver is configured to specifically block python clients. This is sometimes done to avoid scraping, even though in my eyes this makes no sense as you can trivially circumvent it by setting a fake user agent (e.g. the user agent of a browser)

mortonjt commented 2 years ago

Hi @konstin , are you still running into this problem? I'd like to close this issue if so

konstin commented 2 years ago

no idea, i'm using something different now