geobtaa / geoportal

Big Ten Academic Alliance Geoportal
Other
10 stars 5 forks source link

URI checker and redirects #428

Open Wiscmapper opened 2 years ago

Wiscmapper commented 2 years ago

We use the same code as BTAA for the URI checker developed by @ewlarson. It works great! Something we've uncovered recently isn't necessarily a bug in the URI checker. In short, if the checked link automatically redirects to a different url, it seems like the checked linked is flagged as broken. This seems particularly problematic for web apps that are cataloged in Esri Open Data sites.

Here's an example. We have this record in GeoData@Wisconsin: https://geodata.wisc.edu/catalog/WaukeshaCounty-elevation-and-imagery-data-download-by-plss (the contents of this item are automatically grabbed from the Waukesha County Hub site via the Esri API that spits out DCAT-format records)

The link listed in the Waukesha Hub site is: https://data-waukeshacounty.opendata.arcgis.com/apps/WaukeshaCounty::elevation-and-imagery-data-download-by-plss

BUT when you visit the above URL, it automatically redirects to: https://gis2.waukcogeo.com/portal//apps/webappviewer/index.html?id=32f4e52f72d642deb67d0f330db1ab1b

... this redirect is flagged by the URI checker as broken. Is this an Esri issue? Probably! But the question is, can the URI checker be tweaked to deal with these situations?

Wondering if you have observed the same concern with items in the "website" resource class? I did some quick poking around on the BTAA geoportal. I'm curious to know if the following item is flagged by the BTAA URI checker:

https://geo.btaa.org/catalog/99-1200

The source link is https://livingatlas.arcgis.com/, but it auto-redirects to https://livingatlas.arcgis.com/en/home/.

ewlarson commented 1 year ago

So! The @Wiscmapper example URL here is all sorts of odd...

The URL does actually return a 404 (Not Found): https://data-waukeshacounty.opendata.arcgis.com/apps/WaukeshaCounty::elevation-and-imagery-data-download-by-plss

Console / Network Tab

Screenshot 2023-10-09 at 4 22 55 PM

That 404 somehow (via javascript???) gets you to a page that responds with a 304 (Not Modified). I'll think this over some, but it'll be very hard to tell a link checker not to stop at a 404.

Wiscmapper commented 1 year ago

I think this is another one of those "thanks Esri" kind of situations. TL;DR: yeah, I doubt there is much the link checker can do, this is an Esri thing.

Things have changed somewhat since I originally posted this ~18 months ago. A big part of what we do is scraping DCAT records available from Esri Open Data sites. Today I learned they once again have modified what they spit out for those records.

The "old" url (for example) is apparently now: https://data-waukeshacounty.opendata.arcgis.com/api/feed/dcat-us/1.1.json

while the new is seemingly: https://data-waukeshacounty.opendata.arcgis.com//catalog/dcat-ap/2.0.1.json.

In the "old" method, the resource in question has a "accessURL" of: https://data-waukeshacounty.opendata.arcgis.com/apps/WaukeshaCounty::elevation-and-imagery-data-download-by-plss

while new method, accessURL for the same resource is now listed as: https://data-waukeshacounty.opendata.arcgis.com/datasets/238e325d54fc4ae591bfd5df71574458

Even though the old method gives a 404 as Erik correctly pointed out, both of these links above magically redirect to the correct place... which is yet another URL.

Head hurt? Mine too.