Closed kcondon closed 1 year ago
I can't seem to be able to reproduce the broken metadata links issue (Error parsing identifier: doi%3A10.70122/FK2/JAKDVW: ':' not found
).
I go through apache: https://dataverse-internal.iq.harvard.edu/dataset.xhtml?persistentId=doi:10.70122/FK2/WPCTZB, and everything in the metadata tab appears to be working.
Hmm.
[edit: not an issue - the <protocol>
part of the error message must have gotten eaten because of the angle brackets in the original issue description] Also, as an extra weird piece, trying to pass a double-encoded identifier to the api, results in an error message in the log that's different from the one reported, above:
Error parsing identifier: doi%3A10.70122/FK2/WPCTZB: '<protocol>:' not found in string
I'm very confused. [edit: don't be]
(we need to look closer into this, there may be situations where this still can be reproduced. like maybe if there is a login page redirect involved, or something like that)
FWIW: The actual PID parsing changed when PermaLinks was added, so we may have different error messages now and/or be more sensitive to strings not getting decoded in the PID recognition code now.
OK, the "different error message" was a non issue - the error messages in the original description were added as regular text, without the backticks, so the <protocol>
part must have been dropped on account of the angle quotes.
[sad] Condon, Kevin M reacted to your message:
From: landreev @.> Sent: Thursday, August 24, 2023 6:57:30 PM To: IQSS/dataverse @.> Cc: Condon, Kevin M @.>; Author @.> Subject: Re: [IQSS/dataverse] Payara6: URL encoding has changed so going through Apache or directly to 8080 can break things; export download, privateURL download. (Issue #9797)
OK, the "different error message" was a non issue - the error messages in the original description were added as regular text, without the backticks, so the
— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_9797-23issuecomment-2D1692247672&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=TUpjWt9sVfaAC8ETCY_cDPtqJKl7s242PLg6-Wx6UpM&m=d73ua0mINHf_k2LK3rh5tm17Y1rF3l4wKneMD-svmW3qtyRpMTlMw7ujjzu-W6fw&s=DO0mp8vJYBua-8CDsCo_ll3E9nGr5LKz2ZpyC3IZXhc&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABGAYCBQ2ZLF7FVTDTF6BLTXW6PZVANCNFSM6AAAAAA3VXXYTU&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=TUpjWt9sVfaAC8ETCY_cDPtqJKl7s242PLg6-Wx6UpM&m=d73ua0mINHf_k2LK3rh5tm17Y1rF3l4wKneMD-svmW3qtyRpMTlMw7ujjzu-W6fw&s=qeK0gdeTkBQXOXrKV5kXXOrQsagBTw14QDGaFDbgVh0&e=. You are receiving this because you authored the thread.Message ID: @.***>
not "angle quotes", lol... but you know what I meant.
So, anyway, this is what we are looking at: something double-encoded that ":" character when Kevin was testing it (more than once). The error messages are in /usr/local/payara6.8.installed/glassfish/domains/domain1/logs/server.log*
on dataverse-internal, and here is one in the apache access log on the day/time the issue was opened:
98.118.33.138 - - [18/Aug/2023:15:50:20 +0000] "GET /api/datasets/export?exporter=OAI_ORE&persistentId=doi%253A10.70122/FK2/JAKDVW HTTP/1.1" 404 108
...but I haven't been able to make it happen for me now. Based on the same access log, it happened after he got bounced to the login page and back to the dataset page. And of course the login page encodes the whole url string, like this:
98.118.33.138 - - [18/Aug/2023:15:50:14 +0000] "POST /loginpage.xhtml?redirectPage=%2Fdataset.xhtml%3FpersistentId%3Ddoi%3A10.70122%2FFK2%2FJAKDVW HTTP/1.1" 200 161
... but, again, I tried it with a draft, reproducing the same login page redirect loop - and it didn't happen for me. OK, I can't spend any more time on this... but, weird, huh?
I did figure out how to reproduce this, btw. It's not that bad.
Here's what appears to be taking place:
http:
, as in http://dataverse-internal.iq.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi%3A10.70122/FK2/WPCTZB. It is that extra http: -> https:
redirect that somehow results in the doi%3A
getting encoded as doi%253A
. @kcondon stumbled on this while explicitly experimenting with changing the port and/or protocol in the url (see the description). This makes it a little less of a problem, because real life users are unlikely to encounter this under normal operations on an instance configured to sit behind apache w/ https.
However, doi%3A
and hdl%3A
in persistent ids are failing in all urls then. I.e., http://dataverse-internal.iq.harvard.edu/dataset.xhtml?persistentId=doi%3A10.70122/FK2/WPCTZB isn't working either. Rather than trying to figure out why this has started happening under p6, I feel like we should just add a defensive '%3A' -> ':'
substitution to the persistent id check. I made a draft pr with the fix. It looks like this has already been happening once in a while - I see an occasional doi%253A
resulting in a 404 in the prod. access logs (probably on account of bookmarked or harvested urls? - idk). So, would be a reasonable fix to add regardless.
The fact that this is happening because of the extra http: -> https:
redirect is by itself weird; because that's done entirely under apache... so not immediately clear how the p6 upgrade would even affect that... But, once again, I'm not sure we want to spend much time figuring out the why part.
The above feels like more text than this problem warranted already.
Tested on payara 6 .8 on java 17: https://github.com/IQSS/dataverse/pull/9764
After running the installer, found siteURL had :8080 at the end of fqdn. Later I removed it but noticed it seemed to cause trouble when it was 8080 for private url file download, worked when removed and trouble for downloading dataset metadata export when not there and worked when 8080 was there, so opposite. Might this be a more general parsing/encoding issue that could appear elsewhere?
The errors:
Trying to download a file from a private url with port 8080 set for siteURL: http://dataverse-internal.iq.harvard.edu:8080/privateurl.xhtml?token=f09ccd96-5e3e-4299-a57e-f0ae7dbe5d55
status | "ERROR" -- | -- code | 403 message | "Not authorized to access this object via this API endpoint. Please check your code for typos, or consult our API guide at http://guides.dataverse.org." requestUrl | "http://dataverse-internal.iq.harvard.edu:8080/api/v1/access/datafile/24?gbrecs=true" requestMethod | "GET"Trying to download dataset export metadata when port 8080 is not set (ie going through apache on 80):
{"status":"ERROR","message":"A dataset with the persistentId doi%3A10.70122/FK2/JAKDVW could not be found."}
[2023-08-18T15:47:47.365+0000] [Payara 6.2023.8] [INFO] [] [edu.harvard.iq.dataverse.AbstractGlobalIdServiceBean] [tid: _ThreadID=97 _ThreadName=http-thread-pool::jk-connector(4)] [timeMillis: 1692373667365] [levelValue: 800] [[ Error parsing identifier: doi%3A10.70122/FK2/JAKDVW: ':' not found in string]]
[Update] the export issue seemed to be corrected by adding https rather than http when going through port 80. I'd just removed 8080 from siteURL without adjusting http.
Note that my payara 5 installation uses http:// and port 8080 and it appears to work. Would need to retest to confirm.