WICG / nav-speculation

Proposal to enable privacy-enhanced preloading
https://wicg.github.io/nav-speculation/
Other
150 stars 35 forks source link

Pattern to exclude any URLs with query strings. #329

Closed aryadhiratara closed 3 weeks ago

aryadhiratara commented 3 weeks ago

Context:

I'm looking for a regular expression pattern to exclude any URLs containing query strings using the Speculation Rules API.

I tested various patterns, but none worked as expected.

This simple pattern works:

{ "not": {"href_matches": "?*"}}

But only if the url with query strings pointed to the same page.

For example,

  Following the example I found in MDN and Chrome’s documentation:

Any idea for the correct regex to exclude any URLs with query strings that will work?  

Found Unexpected Behavior:

I accidentally found that when I use this/*\\?*(^|&) in the not href_match value (it should matches URLs that contain a ?, and optionally, those followed by an & )

{ "not": {"href_matches": "/*\\?*(^|&)"}}

Full code:

<script type="speculationrules">
{
  "prefetch": [{
    "source": "document",
    "where": {
      "and": [
        { "href_matches": "/*" },
        { "not": {"href_matches": "/*\\?*(^|&)"}}
      ]
    },
    "eagerness": "moderate"
  }]
}
</script>

The pattern inadvertently excludes normal URLs without query strings, preventing them from being prefetched while allowing URLs with query strings to be prefetched.

So I decided to test by emptying the 'not' value, and placed the URL in the (include) href_match

{ "href_matches": "/*\\?*(^|&)" }

Full code:

<script type="speculationrules">
{
  "prefetch": [{
    "source": "document",
    "where": {
      "and": [
        { "href_matches": "/*\\?*(^|&)" },
        { "not": {"href_matches": ""}}
      ]
    },
    "eagerness": "moderate"
  }]
}
</script>

It effectively excludes any URL with query strings while allowing normal URLs to be prefetched as intended.

Could someone provide clarification on why this behavior occurs and how to correctly exclude all URLs containing query strings using the Speculation Rules API?

tunetheweb commented 3 weeks ago

This seems to work for me for query strings with an = sign (e.g. https://example.com?a=b):

<script type="speculationrules">
        {
          "prerender": [
            {
              "where": {
                "and": [
                  { "href_matches": "/*" },
                  { "not": { "href_matches": "*\\?*=*" } }
                ]
              }
            }
          ]
        }
</script>

Example here: https://prerender-demos.glitch.me/no-query-params.html

However it doesn't exclude query strings without a value (e.g. https://example.com?a). Not sure if it's possible to do that?

aryadhiratara commented 3 weeks ago

Hi @tunetheweb,

Thank you for the quick respond and suggestion.

Unfortunately, I have tried that pattern. It makes all links, even the normal links (without query string present), it's not prefetched.

In the example that you provided https://prerender-demos.glitch.me/no-query-params.html, this one <a href="[next.html](https://prerender-demos.glitch.me/next.html)">This link should be prerendered</a> is not getting prefetched when hovered.

In my tests, if we add * before the ?, it always make the prefetch not working to all kind of URLs.

Any other suggestion?

Again, thank you :)

domenic commented 3 weeks ago

This is really a question for https://github.com/whatwg/urlpattern.

One thing you may not be aware of is that you don't have to use the string form of URL patterns. You can use the dictionary form too. So I think something like this (not tested) would work:

{ "not": { "href_matches": { "search": "(.+)" } } }

I suggest playing around with the URL Pattern API directly which will make your testing go much faster. Just do

(new URLPattern(whateverYouWouldPutForHrefMatches, { baseURL: document.baseURI })).test(targetURL)
aryadhiratara commented 3 weeks ago

Hi @domenic,

Thank you. I'll check out the URL Pattern API after this.

tunetheweb commented 3 weeks ago

In the example that you provided https://prerender-demos.glitch.me/no-query-params.html, this one This link should be prerendered is not getting prefetched when hovered.

@aryadhiratara I didn't have eagerness set to it defaulted to pointerdown. It's now set and I can see it recognising the URLs and prerendering them on hover.

One thing you may not be aware of is that you don't have to use the string form of URL patterns.

@domenic I did not know that! I will add that to [our docs]https://developer.chrome.com/docs/web-platform/prerender-pages), however I cannot get it to work.

This works:

const pattern = new URLPattern({ search: '(.+)' }, { baseURL: "https://prerender-demos.glitch.me" });
console.log(pattern.test("https://prerender-demos.glitch.me/next.html")); // false
console.log(pattern.test("https://prerender-demos.glitch.me/next.html?query")); // true
console.log(pattern.test("https://prerender-demos.glitch.me/next.html?query=123")); // true

But I tried several variants of that in Speculation Rules and not of it is accepted by Chrome:

"href_matches": { search: "(.+)" }
"href_matches": { "search": "(.+)" }
"href_matches": "{ search: '(.+)' }"
jeremyroman commented 3 weeks ago

The second of those is syntactically closest, but { "search": "(.+)" } is inheriting the pathname from the base URL (the same way that a link to "?foo" would inherit the pathname, so unless you're on next.html it doesn't match.

For example, for all same-origin links with non-empty query strings:

{"href_matches": { "pathname": "/*", "search": "(.+)" }}
{"href_matches": "/*\\?(.+)"}

(The backslash is necessary here because *? has a special meaning in the pattern syntax. Specifically, /foo/*? matches /foo while /foo/* does not and requires a forward slash after foo. So it's necessary to tell the parser to treat it as a ? that divides URL components instead.)

Or if you want absolutely everything with a non-empty query string, specifying the protocol explicitly wipes out inheritance (you can also specify the other components if you like, and in the string form you must):

{"href_matches": { "protocol": "*", "search": "(.+)" }}
{"href_matches": "*:*:*\\?(.+)" }

The reason your comparison didn't work correctly is because baseURL should be in the same object as the other URL components when using the URLPattern constructor; the second object is an "options" dictionary that doesn't have a baseURL field. @domenic should have written:

(new URLPattern({baseURL: document.baseURI, ...whateverYouWouldPutForHrefMatches})).test(targetURL)

At some point I should write a URL pattern explorer tool; despite our efforts to make as many things as possible natural, quirks persist.

tunetheweb commented 3 weeks ago

Thanks! Updated the demo.

I think the key thing I was missing is that it seems like you need parentheses to access full regex functionality (i.e. to get access to the + operator needed in this example to match 1 or more times, rather than 0 or more times). I don't see that explained anywhere on the MDN docs nor the spec. It's kinda there in examples but the parentheses could easily be ignored by those less familiar with this syntax (i.e. me!) as being part of the regex. Raised spec issue and MDN PR.

Anyway, that is definitely a URL Pattern issue and not for here. Confirmed it is all working for Speculation Rules once you know the syntax! So think this issue can be closed now.

jeremyroman commented 3 weeks ago

Just a quick additional note -- if you just want "has no query params", then instead of negating a "has query params" pattern, you could also use "search": "" (or similar in the string syntax) to accomplish that. For example, in your demo:

                "and": [
                  { "href_matches": "/*" },
                  { "not": { "href_matches": "*\\?(.+)" } }

could also be written

{ "href_matches": "/*\\?" }

or

{ "href_matches": { "pathname": "/*", "search": "" } }
aryadhiratara commented 2 weeks ago

Hi @jeremyroman,

Need clarification. So if I need to exclude any URLs with query string, I should use this?

<script type="speculationrules">
{
  "prefetch": [{
    "source": "document",
    "where": {
      "and": [
        { "href_matches": "/*" },
        { "not": { "href_matches": { "pathname": "/*", "search": "" } } }
      ]
    },
    "eagerness": "moderate"
  }]
}
</script>
tunetheweb commented 2 weeks ago

This works too as given in my demo:

  <script type="speculationrules">
        {
          "prefetch": [
            {
              "where": {
                "href_matches": "/*\\?"
              },
              "eagerness": "moderate"
            }
          ]
        }
  </script>
aryadhiratara commented 2 weeks ago

Hi @tunetheweb,

Yes, that's what I'm trying to explain :) It's similar with what I use before

<script type="speculationrules">
{
  "prefetch": [{
    "source": "document",
    "where": {
      "and": [
        { "href_matches": "/*\\?*(^|&)" },
        { "not": {"href_matches": ""}}
      ]
    },
    "eagerness": "moderate"
  }]
}
</script>

It effectively excludes any URL with query strings while allowing normal URLs to be prefetched as intended.

Logically, shouldn't this /*\\? be in the "not href_matches" section?

  If I use this

<script type="speculationrules">
{
  "prefetch": [{
    "source": "document",
    "where": {
      "and": [
        { "href_matches": "/*" },
        { "not": { "href_matches": { "pathname": "/*", "search": "" } } }
      ]
    },
    "eagerness": "moderate"
  }]
}
</script>

the results is the opposite; normal urls are excluded while urls with query strings able to be prefetch.

However if use this

<script type="speculationrules">
{
  "prefetch": [{
    "source": "document",
    "where": {
      "and": [
        {"href_matches": { "pathname": "/*", "search": "" }}
      ]
    },
    "eagerness": "moderate"
  }]
}
</script>

It effectively excludes any URL with query strings while allowing normal URLs to be prefetched as intended.

tunetheweb commented 2 weeks ago

As /*\\? ends on the Query Param (without a * after it) it effectively says it only matches URLs with an empty query param so it;'s the same as: {"href_matches": { "pathname": "/*", "search": "" }}. Try it!

aryadhiratara commented 2 weeks ago

ah, I finally got it. this explanation

it effectively says it only matches URLs with an empty query param

is what missing from my understanding.

Thank you very much @tunetheweb 🙏🏼