DSpace / dspace-angular

DSpace User Interface built on Angular.io
https://wiki.lyrasis.org/display/DSDOC8x/
BSD 3-Clause "New" or "Revised" License
126 stars 418 forks source link

Bitstream download URLs based on the handle of the item and the bitstream filename don't work with accented characters #2727

Closed Marwa1988-S closed 3 months ago

Marwa1988-S commented 8 months ago

When trying to download a bitstream using the URL: /bitstream/handle/[prefix]/[suffix]/[filename] if the filename in the URL contain non-English characters, it redirects to 504 ERROR.

To Reproduce

  1. login to https://demo.dspace.org/home
  2. copy https://demo.dspace.org/bitstream/123456789/263/1/An_Interview_with_Sarah_Asebedo_ć.pdf in the browser

Expected behavior The bitstream file should be open

Try another file with just English characters in the filename: https://demo.dspace.org/bitstream/123456789/263/1/An_Interview_with_Sarah_Asebedo.pdf, it will works.

alanorth commented 8 months ago

I think this is an issue with the Amazon Cloudfront configuration for demo.dspace.org. I uploaded a bitstream with the same name to my DSpace 7.6 test repository and was able to download it.

Screenshot 2024-01-03 at 07-56-26 Morphology and growth of maize IITA research guide No  9-fs8

alexandrevryghem commented 8 months ago

Atmire would like to claim this issue. Currently only these 2 legacy urls are supported:

But we found out that some of our clients also used this format /bitstream/handle/{prefix}/{suffix}/{sequence_id}/{filename}, which currently redirects you to that 404 page

tdonohue commented 8 months ago

@alexandrevryghem : I'm confused by your comment & this ticket description. This ticket description seems to say that the bug is with accented characters, but your comment implies it's not an issue with accented characters, but rather an issue of a missing legacy URL.

I'm OK with Atmire claiming this & will assign to you. But, please update the ticket description/title if the bug is actually a missing legacy URL. That way testers will understand what your (eventual) PR is trying to fix.

alexandrevryghem commented 8 months ago

@tdonohue: I didn't create this ticket so I can't update it but like Alan mentioned it's not related to the accented characters. When you look at the example url that was given by @Marwa1988-S and the format he used /bitstream/handle/[prefix]/[suffix]/[filename] in his description these urls don't match. That's why I assume that this is related to that bug, but maybe it would be best to wait until @Marwa1988-S can confirm this 😅

Marwa1988-S commented 8 months ago

I think this is the correct muster for my URL: /bitstream/{handle-prefix}/{handle-suffix}/{sequence_id}/{filename}

As already described in my description: if the URL does not have an accent character in the file name, you can access it directly, and if not it gives a 504 error. other example: item with handle = 10673/1194 has 2 text files: Testing file.txt, file2_ć.txt This link works: https://demo.dspace.org/bitstream/10673/1194/1/Testing file.txt while this doesn't: https://demo.dspace.org/bitstream/10673/1194/1/file2_ć.txt

I'm not sure if this refers to an issue with legacy url or some configurations.

alexandrevryghem commented 8 months ago

Thnx for the additional info, this is indeed a separate issue. The correct url for downloading that bitstream is actually: https://demo.dspace.org/bitstream/10673/1194/3/file2_%C4%87.txt

Because the sequenceId of that bitstream is 3 (see here), but that also throws a 504 error

MW3000 commented 8 months ago

What is the meaning of sequenceId? And the sequenceID is independent of the bug. The bug also occurs with the correct sequenceID.

alexandrevryghem commented 8 months ago

It's a number that ensured that the URLs were unique for each bitstream because it is possible for two bitstreams on a certain item to have the same name. It indeed seems to be more of a configuration problem (but I didn't check it myself).

tdonohue commented 8 months ago

All, I dug into this a bit more today with help from the team hosting demo.dspace.org.

It looks (to me) like it might be a DSpace UI bug with the redirect from a legacy URL to a new one.. e.g. A legacy URL like this [dspace.ui.url]/bitstream/[handle-prefix]/[handle-suffix]/[sequence]/[filename] or [dspace.ui.url]/bitstream/handle/[handle-prefix]/[handle-suffix]/[filename] Will redirect to the new download URLs like this: [dspace.ui.url]/bitstreams/[bitstream-uuid]/download

For instance, I notice that downloading files with non-English characters works perfectly on the demo site, if you visit the Item page first and click on the filename.

However, if instead you use a direct link using the legacy URL style (from DSpace 6.x), then you will encounter this 504 error. My current guess is this ticket may be related to #1242, which added automatic redirects for these legacy (DSpace 6.x) URLs. It's possible that redirect isn't working properly when it encounters non-English characters.

@alexandrevryghem : This analysis does mean that it might be possible that this ticket is loosely related to the legacy redirect issues you noticed with your clients. It's just that this ticket points out an additional problem...that the legacy redirects don't handle special characters well. Let me know if you'd like to claim this ticket or not.

alexandrevryghem commented 8 months ago

@tdonohue: I was able to reproduce this bug locally with the demo backend. This error is only reproducible when you run Angular in production mode and the fix is simply to encode the filename you send to the backend.

I can create a small PR for this, but maybe it would be cleaner to fix these encoding issues higher up (in RequestParam). I recently also fixed another issue which is also related to encoding parameters and encoding it higher up would fix both issues at once. If you agree that this is better I can still create this quick fix for 7.6.2, but then I could also create an additional fix for 8.0 where all RequestParams are also encoded in that constructor (I would then also remove some hardcoded parameter encoding in order to not encode those parameters twice for example here).

tdonohue commented 8 months ago

@alexandrevryghem : I have to admit, I don't have a strong preference here. Either fix is fine. So, I'd recommend using your best judgement....or you could ping @artlowel to see if he has a strong preference.

tdonohue commented 6 months ago

@alexandrevryghem : Are you still working on this ticket? (Somehow it is no longer assigned to anyone) Just noting this bug came up in a discussion on dspace-tech: https://groups.google.com/g/dspace-tech/c/FO2cPiQWhlA

alexandrevryghem commented 6 months ago

@tdonohue: I can create fix for this if you want, I think that fixing it here for all the RequestParams would maybe be cleaner. The benefit of fixing it there would be that other functionlities would automatically also be fixed like #2724 and it would prevent new functionalities to have the same issue. The only disadvantage is that if you already fixed it by passing the encoded value (like here) you would need to remove the encodeURIComponent, so this might be a "breaking" change

tdonohue commented 6 months ago

@alexandrevryghem : If you have time, that'd be great. I would however like to backport this to 7.x if possible. Even if it requires removing encodeURIComponent calls, I think that's OK to backport...as long as we note that change.

I'll assign this to you officially.

pkmcnc commented 4 months ago

@alexandrevryghem Hi! Can you share the minimum fix for non-English characters in URL? I have doubts about applying commit #2725 to 7.6.1 Thanks!

tdonohue commented 4 months ago

After #2725 was merged, this bug still exists. However, the behavior has changed slightly. This is what I'm seeing when I run in production mode (yarn build:prod; yarn serve:ssr):

  1. First, create an item with two files with no access restrictions. Make sure one has zero accented characters and one has at least one. Here's an example:
    • One bitstream named test.doc
    • One bitstream named test_ć.pdf
  2. Determine the Item's handle (e.g. 123456789/10). Make sure that a basic redirect works properly like this:
  3. Now, try the "handle" style redirect for the first bitstream (test.doc). It should work.
    • Visit http://localhost:4000/bitstream/handle/123456789/10/test.doc and verify the file is downloaded.
    • Or, run curl --head http://localhost:4000/bitstream/handle/123456789/10/test.doc and make sure it returns a 302 redirect. (The 302 redirect is slightly wrong, but that's a different bug ticket #2963 )
  4. Finally, try the same redirect for the accented bitstream (test_ć.pdf) It will NOT work

So, after #2725, special characters in bitstreams no longer return an error. But, they still don't work properly if you attempt to access them via the /bitstream/handle/[prefix]/[suffix]/[bitstream-filename] URL. (They do work though if you just click the download link in the UI)

It's possible however that fixing #2963 will also now fix this bug, as the new behavior almost seems like a redirect failure? Pinging @artlowel and @alexandrevryghem to let them know these two issues might now be related (unverified though).

alexandrevryghem commented 4 months ago

@tdonohue: I retested this myself today and for me the redirect now works correctly on sandbox, I think there might have been a confusion because there are 2 characters that look like a "c" with an accent but they have different unicodes so it may not look like it but: 'ć' !== 'ć' 😅

So currently:

With #2963:

tdonohue commented 4 months ago

@alexandrevryghem : I tried this again today, and it's still not working for me on Sandbox.

I created a new test Item to test with: https://sandbox.dspace.org/handle/10673/1131 It has three files, one with no special characters and the other two having the names you specified above.

Here's my results for curl --head for each:

$ curl --head https://sandbox.dspace.org/bitstream/handle/10673/1131/test_pdf.pdf
HTTP/2 302
date: Mon, 20 May 2024 15:09:20 GMT
content-type: text/plain; charset=utf-8
content-length: 120
location: https://sandbox.dspace.org/server/api/core/bitstreams/83ebc316-dc6b-45f6-b667-b76ad99a818c/content
x-powered-by: Express
x-ratelimit-limit: 500
x-ratelimit-remaining: 499
x-ratelimit-reset: 1716217800
cache-control: max-age=604800
link:
vary: Accept

$ curl --head https://sandbox.dspace.org/bitstream/handle/10673/1131/1test_ć.pdf
HTTP/2 200
date: Mon, 20 May 2024 15:09:45 GMT
content-type: text/html; charset=utf-8
content-length: 360440
x-powered-by: Express
x-ratelimit-limit: 500
x-ratelimit-remaining: 498
x-ratelimit-reset: 1716217800
cache-control: max-age=604800
link:
etag: W/"57ff8-PNVyJKGADoZ+xlKUSdZLlmLoy4A"
vary: Accept-Encoding

$ curl --head https://sandbox.dspace.org/bitstream/handle/10673/1131/2test_ć.pdf
HTTP/2 200
date: Mon, 20 May 2024 15:10:12 GMT
content-type: text/html; charset=utf-8
content-length: 360435
x-powered-by: Express
x-ratelimit-limit: 500
x-ratelimit-remaining: 499
x-ratelimit-reset: 1716217860
cache-control: max-age=604800
link:
etag: W/"57ff3-K0RDKeL2LMu4u9/nzyhTKTtisvk"
vary: Accept-Encoding

So, as you can see, on my end the curl only returns a 302 if there are no special characters. When special characters exist, I'm still seeing a 200 OK

tdonohue commented 4 months ago

@alexandrevryghem : A follow-up. I did notice it works perfectly in my browser now.

If I copy either of these URLs into my browser window, then I briefly see the download page in the UI & I'm properly redirected to download the file.

https://sandbox.dspace.org/bitstream/handle/10673/1131/1test_ć.pdf https://sandbox.dspace.org/bitstream/handle/10673/1131/2test_ć.pdf

It's odd to me though that the curl returns 200 OK, even though the redirect does occur in the browser.

So, this bug ticket appears to be partially fixed. These /bitstream/handle/[prefix]/[suffix]/[filename] download URLs are redirecting & working. It's just they are sometimes returning a 200 OK instead of the expected 302 Redirect

I'll try to test the fixes in #3062 later today to see if they have an impact on this & stop the 200 OK from being returned.

alexandrevryghem commented 4 months ago

@tdonohue: I just retested it now too, but for me it still returns 302 for me so that's weird 😅

→ curl --head https://sandbox.dspace.org/bitstream/handle/10673/1131/test_pdf.pdf 

HTTP/2 302
date: Mon, 20 May 2024 15:42:21 GMT
content-type: text/plain; charset=utf-8
content-length: 120
location: https://sandbox.dspace.org/server/api/core/bitstreams/83ebc316-dc6b-45f6-b667-b76ad99a818c/content
x-powered-by: Express
x-ratelimit-limit: 500
x-ratelimit-remaining: 499
x-ratelimit-reset: 1716219780
cache-control: max-age=604800
vary: Accept

→ curl --head https://sandbox.dspace.org/bitstream/handle/10673/1131/1test_ć.pdf
HTTP/2 302
date: Mon, 20 May 2024 15:42:32 GMT
content-type: text/plain; charset=utf-8
content-length: 120
location: https://sandbox.dspace.org/server/api/core/bitstreams/55021e94-455e-4eab-b1b8-eff9f552b882/content
x-powered-by: Express
x-ratelimit-limit: 500
x-ratelimit-remaining: 498
x-ratelimit-reset: 1716219780
cache-control: max-age=604800
link:
vary: Accept

→ curl --head https://sandbox.dspace.org/bitstream/handle/10673/1131/2test_ć.pdf
HTTP/2 302
date: Mon, 20 May 2024 15:42:41 GMT
content-type: text/plain; charset=utf-8
content-length: 120
location: https://sandbox.dspace.org/server/api/core/bitstreams/de373b4a-9959-4c3e-82f5-e855926050d9/content
x-powered-by: Express
x-ratelimit-limit: 500
x-ratelimit-remaining: 499
x-ratelimit-reset: 1716219780
cache-control: max-age=604800
link:
vary: Accept

Maybe it has to do with our curl version, mine is on 8.6.0:

→ curl --version                                                                                                                                                                                           [17:42:41]
curl 8.6.0 (x86_64-apple-darwin23.0) libcurl/8.6.0 (SecureTransport) LibreSSL/3.3.6 zlib/1.2.12 nghttp2/1.61.0
Release-Date: 2024-01-31
Protocols: dict file ftp ftps gopher gophers http https imap imaps ipfs ipns ldap ldaps mqtt pop3 pop3s rtsp smb smbs smtp smtps telnet tftp
Features: alt-svc AsynchDNS GSS-API HSTS HTTP2 HTTPS-proxy IPv6 Kerberos Largefile libz MultiSSL NTLM NTLM_WB SPNEGO SSL threadsafe UnixSocketsc
tdonohue commented 4 months ago

@alexandrevryghem You may be correct that it's something with curl. I seem to be running an older version::

curl 7.81.0 (x86_64-pc-linux-gnu) libcurl/7.81.0 OpenSSL/3.0.2 zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.14

In any case, I'm seeing your point that it appears this ticket might already be "solved". I've got some time in my afternoon to do some more testing, so I'll see if #3062 helps. Either way, if this is working for later versions of curl, that seems good enough for me. I'm hoping that would mean Google Scholar (and similar crawlers) will see this redirect properly instead of seeing the "200 OK".