DSpace / dspace-angular

DSpace User Interface built on Angular.io
https://wiki.lyrasis.org/display/DSDOC8x/
BSD 3-Clause "New" or "Revised" License
134 stars 433 forks source link

Old 6.x Bitstream URL paths are redirecting with a 302 (temporary) instead of 301 (permanent) #2963

Closed tdonohue closed 5 months ago

tdonohue commented 6 months ago

Describe the bug Reported by the Google Scholar team. This is a follow-up to the work in #2331.

In #2331, we fixed an issue where URLs of this format will now return a 301 redirect to the proper DSpace 7+ URL:

[dspace.ui.url]/handle/[prefix]/[suffix]

For example, this correctly returns a 301 redirect (on either Demo or Sandbox site)

curl --head https://sandbox.dspace.org/handle/123456789/258

However, Bitstream URLs of a similar format are NOT properly returning a 301 redirect.

[dspace.ui.url]/bitstream/handle/[prefix]/[suffix]/[filename]

For example, this incorrectly returns a 302 redirect

curl --head https://sandbox.dspace.org/bitstream/handle/123456789/258/Money%20and%20Emerging%20Adults.pdf

Expected behavior All older-style DSpace 6.x URL patterns should return a 301 redirect to the new DSpace 7.x URL. This fix will need to be backported to 7.x for the 7.6.2 release.

Related work Related to #2331 and #1242

artlowel commented 6 months ago

@tdonohue it doesn't look like the legacy URL redirects at all. Instead it retrieves the bitstream using the byItemHandle endpoint and uses the same bitstream download component you'd find at /bitstreams/${uuid}/download, and that component is what redirects using a 302, to the /content link for that bitstream on the backend.

You can verify that's the case by testing the modern url for the bitstream you used as an example

curl --head https://sandbox.dspace.org/bitstreams/aa2c8a3f-22a3-4820-8360-26cf74874428/download

It will also return a 302.

So there are two potential issues.

You could argue that the legacy URL should redirect to the modern URL with a 301 instead of functioning as an alternative to the modern URL, to discourage people from using the legacy URLs

I don't think we can use a 301 to redirect the modern UI URL to the backend /content URL in all cases, because for logged in users it will contain a short lived token. We could theoretically do it just in the case of unauthenticated users, but that would likely also mean that if the bitstream uses a lease (or if the administrators simply change the access conditions over time) that a 301 link some crawler has indexed no longer works. So I'd be inclined to keep those at a 302.

If you agree, we can claim this ticket and change it just so the legacy URLs redirect to the modern URLs with a 301 (after which another 302 redirect will follow to the backend /content endpoint)

tdonohue commented 6 months ago

@artlowel : Yes, I think the goal here should be to simply discourage people from using the legacy URLs (and discourage search engines from keeping them in their search results). From my understanding, search engines will keep around links that return 302, as that's a temporary redirect. But, they discard & replace links that return a 301...as that lets them know the old URL is no longer valid.

So, I think the solution here is that the legacy URL should return a 301 and redirect to the new URL. That new URL can still return a 302 to pass the user the content from the backend. If we find this process is problematic in some manner, we can bring this back to Google Scholar. But, the request from them was simply to ensure that all our legacy URLs return a 301 to point them at the new URLs.

I'll assign this to you. Thanks!