IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
876 stars 484 forks source link

Payara6: URL encoding has changed so going through Apache or directly to 8080 can break things; export download, privateURL download. #9797

Closed kcondon closed 1 year ago

kcondon commented 1 year ago

Tested on payara 6 .8 on java 17: https://github.com/IQSS/dataverse/pull/9764

After running the installer, found siteURL had :8080 at the end of fqdn. Later I removed it but noticed it seemed to cause trouble when it was 8080 for private url file download, worked when removed and trouble for downloading dataset metadata export when not there and worked when 8080 was there, so opposite. Might this be a more general parsing/encoding issue that could appear elsewhere?

The errors:

  1. Trying to download a file from a private url with port 8080 set for siteURL: http://dataverse-internal.iq.harvard.edu:8080/privateurl.xhtml?token=f09ccd96-5e3e-4299-a57e-f0ae7dbe5d55

    status | "ERROR" -- | -- code | 403 message | "Not authorized to access this object via this API endpoint. Please check your code for typos, or consult our API guide at http://guides.dataverse.org." requestUrl | "http://dataverse-internal.iq.harvard.edu:8080/api/v1/access/datafile/24?gbrecs=true" requestMethod | "GET"
  2. Trying to download dataset export metadata when port 8080 is not set (ie going through apache on 80):

{"status":"ERROR","message":"A dataset with the persistentId doi%3A10.70122/FK2/JAKDVW could not be found."}

[2023-08-18T15:47:47.365+0000] [Payara 6.2023.8] [INFO] [] [edu.harvard.iq.dataverse.AbstractGlobalIdServiceBean] [tid: _ThreadID=97 _ThreadName=http-thread-pool::jk-connector(4)] [timeMillis: 1692373667365] [levelValue: 800] [[ Error parsing identifier: doi%3A10.70122/FK2/JAKDVW: ':' not found in string]]

[Update] the export issue seemed to be corrected by adding https rather than http when going through port 80. I'd just removed 8080 from siteURL without adjusting http.

Note that my payara 5 installation uses http:// and port 8080 and it appears to work. Would need to retest to confirm.

landreev commented 1 year ago

I can't seem to be able to reproduce the broken metadata links issue (Error parsing identifier: doi%3A10.70122/FK2/JAKDVW: ':' not found). I go through apache: https://dataverse-internal.iq.harvard.edu/dataset.xhtml?persistentId=doi:10.70122/FK2/WPCTZB, and everything in the metadata tab appears to be working. Hmm.

landreev commented 1 year ago

[edit: not an issue - the <protocol> part of the error message must have gotten eaten because of the angle brackets in the original issue description] Also, as an extra weird piece, trying to pass a double-encoded identifier to the api, results in an error message in the log that's different from the one reported, above:

using https://dataverse-internal.iq.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi%253A10.70122/FK2/WPCTZB

Error parsing identifier: doi%3A10.70122/FK2/WPCTZB: '<protocol>:' not found in string

I'm very confused. [edit: don't be]

landreev commented 1 year ago

(we need to look closer into this, there may be situations where this still can be reproduced. like maybe if there is a login page redirect involved, or something like that)

qqmyers commented 1 year ago

FWIW: The actual PID parsing changed when PermaLinks was added, so we may have different error messages now and/or be more sensitive to strings not getting decoded in the PID recognition code now.

landreev commented 1 year ago

OK, the "different error message" was a non issue - the error messages in the original description were added as regular text, without the backticks, so the <protocol> part must have been dropped on account of the angle quotes.

kcondon commented 1 year ago

[sad] Condon, Kevin M reacted to your message:


From: landreev @.> Sent: Thursday, August 24, 2023 6:57:30 PM To: IQSS/dataverse @.> Cc: Condon, Kevin M @.>; Author @.> Subject: Re: [IQSS/dataverse] Payara6: URL encoding has changed so going through Apache or directly to 8080 can break things; export download, privateURL download. (Issue #9797)

OK, the "different error message" was a non issue - the error messages in the original description were added as regular text, without the backticks, so the part must have been dropped on account of the angle quotes.

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_9797-23issuecomment-2D1692247672&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=TUpjWt9sVfaAC8ETCY_cDPtqJKl7s242PLg6-Wx6UpM&m=d73ua0mINHf_k2LK3rh5tm17Y1rF3l4wKneMD-svmW3qtyRpMTlMw7ujjzu-W6fw&s=DO0mp8vJYBua-8CDsCo_ll3E9nGr5LKz2ZpyC3IZXhc&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABGAYCBQ2ZLF7FVTDTF6BLTXW6PZVANCNFSM6AAAAAA3VXXYTU&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=TUpjWt9sVfaAC8ETCY_cDPtqJKl7s242PLg6-Wx6UpM&m=d73ua0mINHf_k2LK3rh5tm17Y1rF3l4wKneMD-svmW3qtyRpMTlMw7ujjzu-W6fw&s=qeK0gdeTkBQXOXrKV5kXXOrQsagBTw14QDGaFDbgVh0&e=. You are receiving this because you authored the thread.Message ID: @.***>

landreev commented 1 year ago

not "angle quotes", lol... but you know what I meant.

landreev commented 1 year ago

So, anyway, this is what we are looking at: something double-encoded that ":" character when Kevin was testing it (more than once). The error messages are in /usr/local/payara6.8.installed/glassfish/domains/domain1/logs/server.log* on dataverse-internal, and here is one in the apache access log on the day/time the issue was opened:

98.118.33.138 - - [18/Aug/2023:15:50:20 +0000] "GET /api/datasets/export?exporter=OAI_ORE&persistentId=doi%253A10.70122/FK2/JAKDVW HTTP/1.1" 404 108

...but I haven't been able to make it happen for me now. Based on the same access log, it happened after he got bounced to the login page and back to the dataset page. And of course the login page encodes the whole url string, like this:

98.118.33.138 - - [18/Aug/2023:15:50:14 +0000] "POST /loginpage.xhtml?redirectPage=%2Fdataset.xhtml%3FpersistentId%3Ddoi%3A10.70122%2FFK2%2FJAKDVW HTTP/1.1" 200 161

... but, again, I tried it with a draft, reproducing the same login page redirect loop - and it didn't happen for me. OK, I can't spend any more time on this... but, weird, huh?

landreev commented 1 year ago

I did figure out how to reproduce this, btw. It's not that bad.

landreev commented 1 year ago

Here's what appears to be taking place:

Rather than trying to figure out why this has started happening under p6, I feel like we should just add a defensive '%3A' -> ':' substitution to the persistent id check. I made a draft pr with the fix. It looks like this has already been happening once in a while - I see an occasional doi%253A resulting in a 404 in the prod. access logs (probably on account of bookmarked or harvested urls? - idk). So, would be a reasonable fix to add regardless.

The fact that this is happening because of the extra http: -> https: redirect is by itself weird; because that's done entirely under apache... so not immediately clear how the p6 upgrade would even affect that... But, once again, I'm not sure we want to spend much time figuring out the why part.

The above feels like more text than this problem warranted already.