ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
Apache License 2.0
59 stars 13 forks source link

image download for http://arctos.database.museum/media/10562312?open fails with http return code 450 (Blocked by Windows Parental Controls) #3950

Closed jhpoelen closed 2 years ago

jhpoelen commented 2 years ago

When programmatically accessing image at http://arctos.database.museum/media/10562312?open , the image cannot be accessed. However, when accessing the image via a browser, the image can be retrieved.

To Reproduce Steps to reproduce the behavior:

  1. run curl -L "http://arctos.database.museum/media/10562312?open" > image.jpg to save the image into the file "image.jpg"
  2. open image.jpg in an image viewer
  3. inspect image

Expected behavior The referenced image is available via image.jpg .

Actual behavior The referenced image is not available in image.jpg. But, is available in a browser (see attached).

Also, for details on error message, see https://github.com/bio-guoda/preston/issues/132 .

$ curl -L -I "http://arctos.database.museum/media/10562312?open"
HTTP/1.1 450 
Set-Cookie: cfid=07a11439-41f1-4b3b-94c2-a9d265790e57;Path=/;Expires=Tue, 12-Oct-2021 15:47:31 UTC;HttpOnly
Set-Cookie: cftoken=0;Path=/;Expires=Tue, 12-Oct-2021 15:47:31 UTC;HttpOnly
Set-Cookie: cfid=94c04dfc-4be9-4664-b3df-58c8e72cf7c8;Path=/;Expires=Tue, 12-Oct-2021 15:47:31 UTC;HttpOnly
Set-Cookie: cftoken=0;Path=/;Expires=Tue, 12-Oct-2021 15:47:31 UTC;HttpOnly
Set-Cookie: JSESSIONID=8F32EB00537A1924C4F5806180E4C98A; Path=/; HttpOnly
Content-Type: text/html;charset=UTF-8
Content-Length: 616
Date: Wed, 22 Sep 2021 14:09:28 GMT
Server: lighttpd/1.4.55

Screenshots chas_mamm_3302 7 Screenshot from 2021-09-22 07-17-29

Data If this involves external data, attach the actual data that caused the problem. Do not attach a transformation or subset. You may ZIP most formats to attach, or request a Box email address for very large files.

Desktop (please complete the following information):

dustymc commented 2 years ago

450 Blocked by Windows Parental Controls (Microsoft)

Wat?

Curl works for me, the error suggests to me your network (or some device on it) but ????


Dustys-MBP:~ dlm$ curl -L -I "http://arctos.database.museum/media/10562312?open"
HTTP/1.1 302 Found
Set-Cookie: cfid=fcba232b-d878-4634-8449-546a87c9d9bf;Path=/;Expires=Tue, 12-Oct-2021 16:03:54 UTC;HttpOnly
Set-Cookie: cftoken=0;Path=/;Expires=Tue, 12-Oct-2021 16:03:54 UTC;HttpOnly
Set-Cookie: JSESSIONID=D65196FF8732058E64C02716701EA618; Path=/; HttpOnly
Location: https://web.corral.tacc.utexas.edu/CAS/20161217-02/jpg/chas_mamm_3302.7.jpg
Content-Type: text/html;charset=UTF-8
Content-Length: 6398
Date: Wed, 22 Sep 2021 14:25:50 GMT
Server: lighttpd/1.4.55

HTTP/1.1 200 OK
Content-Type: image/jpeg
Accept-Ranges: bytes
ETag: "4236783506"
Last-Modified: Tue, 10 Jan 2017 22:55:32 GMT
Strict-Transport-Security: max-age=15768000;
Content-Length: 25517
Date: Wed, 22 Sep 2021 14:25:51 GMT
Server: lighttpd/1.4.55
jhpoelen commented 2 years ago

Interesting. I saw the error on a German-based server.

When I run the same command on my US internet connection:

$ curl -L -I "http://arctos.database.museum/media/10562312?open"
HTTP/1.1 302 Found
Set-Cookie: cfid=82d23655-6328-4d45-ae7d-67108d319c8c;Path=/;Expires=Tue, 12-Oct-2021 16:19:19 UTC;HttpOnly
Set-Cookie: cftoken=0;Path=/;Expires=Tue, 12-Oct-2021 16:19:19 UTC;HttpOnly
Set-Cookie: JSESSIONID=192E43CDAA352DC698D37F3219620E8C; Path=/; HttpOnly
Location: https://web.corral.tacc.utexas.edu/CAS/20161217-02/jpg/chas_mamm_3302.7.jpg
Content-Type: text/html;charset=UTF-8
Content-Length: 6398
Date: Wed, 22 Sep 2021 14:41:15 GMT
Server: lighttpd/1.4.55

HTTP/1.1 200 OK
Content-Type: image/jpeg
Accept-Ranges: bytes
ETag: "4236783506"
Last-Modified: Tue, 10 Jan 2017 22:55:32 GMT
Strict-Transport-Security: max-age=15768000;
Content-Length: 25517
Date: Wed, 22 Sep 2021 14:41:17 GMT
Server: lighttpd/1.4.55

Are the images geo fenced? Or did this German server somehow end up on some blacklist?

dustymc commented 2 years ago

geo fenced

No, nor is anything else.

somehow end up on some blacklist

Very likely, especially if it's sharing IP space with some AWS-like farm. I'm happy to see what I can do if you want to provide an IP, but in general I don't have the resources to properly manage that kind of traffic so just run an aggressive blocker.

jhpoelen commented 2 years ago

Ok. If I provide some ip addresses, can you put those on a whitelist ?

jhpoelen commented 2 years ago

I'm happy to see what I can do if you want to provide an IP

I didn't see your offer before. Yes, I will provide a list of (two) IPs to you via other channels.

dustymc commented 2 years ago

Thanks, I opened both of those.

As above, I don't have the resources to really manage this sort of thing, and there's a fair bit of not-so-great traffic from both of those so no promises that they won't get locked back down. I will do whatever I can safely do if there are more problems, and we can elevate through the Arctos administrative channels if that doesn't prove satisfactory. I assume you're doing something cool and I think we'd all like to support it, but - at the risk of sounding like a broken record - resources.....

Do please note https://arctos.database.museum/robots.txt - we're asking for a 10-second crawl delay, and there is some 'you look like an SEO bot that we have neither the resources nor desire to feed' logic around that.

Also please note that Media are licensed (see eg http://arctos.database.museum/media/10562312) - I'm not sure where else to go with that, but the idea that we make it possible to get the media without the metadata comes up from time to time so there it is....

jhpoelen commented 2 years ago

Hey @dustymc -

Thanks for manually editing your whitelist.

I can confirm that for provided addresses now have access to the previously blocked content.

The following successful access curls can be seen:

$ curl -L -I "http://arctos.database.museum/media/10562312?open"
HTTP/1.1 302 Found
Set-Cookie: cfid=5ff0d7f1-f249-4c16-96f1-c4ceb541f4bb;Path=/;Expires=Tue, 12-Oct-2021 23:45:31 UTC;HttpOnly
Set-Cookie: cftoken=0;Path=/;Expires=Tue, 12-Oct-2021 23:45:31 UTC;HttpOnly
Set-Cookie: JSESSIONID=87FDAD1A9CCE281FEE023B2E720056E0; Path=/; HttpOnly
Location: https://web.corral.tacc.utexas.edu/CAS/20161217-02/jpg/chas_mamm_3302.7.jpg
Content-Type: text/html;charset=UTF-8
Content-Length: 6398
Date: Wed, 22 Sep 2021 22:07:27 GMT
Server: lighttpd/1.4.55

HTTP/1.1 200 OK
Content-Type: image/jpeg
Accept-Ranges: bytes
ETag: "4236783506"
Last-Modified: Tue, 10 Jan 2017 22:55:32 GMT
Strict-Transport-Security: max-age=15768000;
Content-Length: 25517
Date: Wed, 22 Sep 2021 22:07:30 GMT
Server: lighttpd/1.4.55
jhpoelen commented 2 years ago

And, I can see your point about resources. And this is something that I've been bringing up, and hoping to discuss more, in various meetings: the (hidden) cost of keeping and transferring "heavy" content like images, especially when they are stored centrally.

Also, the tools I am building packages the meta-data, including licensing, along with the images. Also, the integrity of the resulting image corpus can be verified so that you can trace exactly what was retrieved at what time, and what came before and after. Also, because the content is identified with hashes, the image corpus can be moved to other web locations (or offline storage) without having to compromise it's integrity or it's ability to verify that integrity. This also means that the copyright is associated with the exact image that was linked at some point in time. So, in theory, you can use this same technique (or other content finger print techniques like spectral analysis / image statistics) to determine whether someone is using, or keeping a copy of, an image from your collection. In theory this would allow to store the content in decentralized manner (perhaps internet archive or library of congress hosting the "heavy" images), and systems like arctos provide the meaningful connections between the digital artifacts (copyright relations, associations with specimen records etc.).

Anyways, tons to talk and think about, and I'd be open to having a live conversation about this.

The example image of the bat jaw that triggered this issue come up as I was showing a proof-of-concept to Kendra Phelps, a member of the EcoHealth Alliance, and collaborator in a biodiversity data hub, at https://jhpoelen.nl/bats . In this example, only media associated with 100 specimen are shown, only some of which come from Arctos.

Also, I'll make a note of figuring out a way to look for robot.txt and adjust request behavior based on it.

jhpoelen commented 2 years ago

@dustymc thanks for helping to make your images accessible. Closing this issue for now, and I am hoping to pickup the discussion around distributing content (and costs) so that images can be preserved across willing institutions and projects without impacting the integrity of the image.