NCEAS / metacatui

MetacatUI: A client-side web interface for DataONE data repositories
https://nceas.github.io/metacatui
Apache License 2.0
42 stars 27 forks source link

Assume encoded identifiers in MetacatUI routes #523

Closed csjx closed 4 years ago

csjx commented 6 years ago

Jeanette pointed out an issue rendering a particular package that has curly braces in the identifier:

https://search.dataone.org/#view/{6A218D8B-05CD-49EA-AC13-FD0372B3B1D4}

When the pid is properly encoded, the following does not render:

https://search.dataone.org/#view/%7B6A218D8B-05CD-49EA-AC13-FD0372B3B1D4%7D

Dave has pointed out that we should be handling identifiers uniformly across all IRIs, and that the internal routing within MetacatUI should do so as well. When interacting with the DataONE API, we do in fact call encodeURIComponent(), but when a route like #view receives an identifier, we don't call decodeURIComponent(), but rather we assume the string is the exact identifier.

Change the routing within MetacatUI to always assume identifiers are encoded in URL contexts by first calling decodeURIComponent() for all routes being passed identifiers before processing the call. When we move to using HTML5 pushState, assume identifiers are encoded as well.

laurenwalker commented 4 years ago

This is something we've wanted to do for a while, but it will break all current view/ URLs that don't have the identifier encoded. Also, all of the DOI registrations will need to be updated, yeah?

amoeba commented 4 years ago

Possibly, but I wonder if maybe it wouldn't be so bad, @laurenwalker. Could you check my thinking to see if it makes sense:

De-encoding strings via decodeURIComponent has a helpful property in that it stops producing a different string once the string is decoded,

"doi:10.123/ABC" === decodeURIComponent("doi:10.123/ABC") // true

So I think we can just blindly decode whatever comes in for some routes, like /view. As an side, a string can be encoded multiple times and you can just keep decoding it until the result stops changing:

"doi:10.123/ABC" === decodeURIComponent(decodeURIComponent(encodeURIComponent(encodeURIComponent("doi:10.123/ABC")))) // true

So I think we can use the above to effectively detect if we have a legacy URL or not. For /view/{ID} this is even simpler since we can just take what's after /view/, decode it, and work with it. Of course, we then have to be careful if we wanna add a another path part to /view like /view/{ID}/{eml-id}.

laurenwalker commented 4 years ago

Ah, thanks Bryce, I didn't think of it that way. Right on!

amoeba commented 4 years ago

I've made an initial pass and like where this is going. In order to simplify things and avoid bugs, my approach is to separate identifier handling into two sides:

  1. Web/URL-side: Identifiers are always URL encoded
    • Backbone Routers always assume identifiers need decoding before passing onward.
    • Views that construct URLs always URLEncode all URL components that may contain characters that should be encoded.
  2. Application side: Identifiers are always un-encoded. If your code is this.get("id"), it's the literal identifier and not an encoded version.

Due to properties of URL encoding/decoding we do induce any breakage in URLs to our content because existing URLs (bookmarks, DOI registrations) using non-encoded PIDs still work. This is because unencodedPID === decodeURLComponent(unencodedPID). We probably should update our registrations or at least change our DOI registration code going forward.

So far I've tested the search view, metadata view, user view, and editor. I found about three of our patterns we use in our codebase that I should be able to search for and refactor.

amoeba commented 4 years ago

Making good but slow progress. I've found about seven or so patterns for how we reference identifiers when building URLs and I'm still finding more here and there as I go. I think this is still quite feasible and testing so far is going very well.

@laurenwalker, @csjx what do you think about maybe asking the data team if we can deploy this on test.arcticdata.io for maybe a week or two before we merge to release just to give us more confidence this isn't breaking things?

laurenwalker commented 4 years ago

We could deploy to my VM fancy-vulture, pointing to test.arcticdata, since the data team doesn't use that

amoeba commented 4 years ago

Well, my thinking was that the data team would help us find remaining bugs.

laurenwalker commented 4 years ago

Oh, I see what you're saying. Yeah let's run it by Jeanette

laurenwalker commented 4 years ago

I tested this a bit today and I didn't run into any major issues! I found one bbug and have two thoughts:

Bug

In provenance charts: when I click the "View" button in a provenance node popover, the window jumps down to the inner-page anchor for that object (expected), but then it redirects to a /view route for that pid. It should only jump to the anchor.

Example: Dataset https://test.arcticdata.io/view/urn%3Auuid%3A66ad360a-30c6-4daa-984b-61910f245fde Click on the derivation node for Data4.csv and click the "View" button for Datat5.csv

Thoughts

Should we encode ORCIDs and DNs in /profile routes? I would think we should.

I think it would be a good idea to URL-encode the identifier in the browser location when you go to a /view URL with an unencoded pid. That way, it looks like a redirect. And if the user copies the URL from their browser, they will have the new encoded URL.

amoeba commented 4 years ago

Hey @laurenwalker , thanks for testing. That bug may have been introduced when I fixed an unrelated bug in the prov chart view in 1d88b3703c5058af6192f596050936a974b74ecf. It must have worked fine before because the link was broken and fixing it broke the behavior. Looks like the click handler needs a preventdefault or we need to convert this to using HTML anchors properly. I'll look into that.

Re: Thoughts, I stuck with just encoding object identifiers to keep on task here and made notes as I went of places in the application where I wanted to discuss other encoding changes. I totally think we should follow this encoding/decoding pattern application wide. Places that popped out were:

What do you think about adding that into this work? I think it'd be feasible.

amoeba commented 4 years ago

FYI: I just pushed up a bug fix for the ProvChartView bug I discovered/introduced: 4450b0331d247bf34f02d48c05b617862a291367. Now I get nice scrolling to other entities.

laurenwalker commented 4 years ago

Hey Bryce, I was reviewing this branch today and it looks good, and since we've tested on test.arcticdata.io for a while, I think it's ready to merge into the dev branch, so I'll do that today.

For these items:

References to the accounts service. Should these be properly encoded? e.g., anywhere we call "MetacatUI.appModel.get("accountsUrl") or similar e.g., https://github.com/NCEAS/metacatui/blob/dev-2.11-encoding/src/js/views/GroupListView.js#L65:L65 MDQ portions of the codebase doesn't encoding/decode things like suiteId and also follow an old URL structure. e.g., MetacatUI.uiRouter.navigate("quality/s=" + suiteId + "/" + encodeURIComponent(this.pid), {trigger: false}); Portal code doesn't encode things like label and portalSection. Is there a chance either or both could need encoding? e.g., MetacatUI.appModel.get("portalTermPlural") + "/" + label + "/" + portalSection);

Yes, I think we should do all of those. But they don't need to be completed at the same time as this work, so I will create a new ticket for each and we can work them in after we finish up the EML 2.2.0 editing.

amoeba commented 3 years ago

I just wanted to drop a note in here in case we come back to encoding/decoding issues again. Above I commented that it seemed like you can detect whether part of a URL is encoded or not by comparing its once-decoded and twice-decoded values. After a good discussion today and a more careful re-read of IETF RFC 3986, I saw this:

Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string. -- https://tools.ietf.org/html/rfc3986#section-2.1

This is relevant to us because our identifiers can include what appears to a percent-encoded character. Earlier on, the RFC talks about a URI being "produced", at which point percent-encoding of reserved characters is assumed. And that clients handling URIs should always be decoding. This is all to say that I now think better of my idea.