Add support for subdomain matches in origins

hugorodgerbrown commented 11 months ago

Fixes #182

I have added a new helper function to wrap up validation of expected_origin in the registration and authentication verify functions. This function handles straight str:str matches, str:list[str] matches, and a new subdomain match.

The subdomain match relies on a sentinel value of "*" as a wildcard to indicate that subdomains are supported. e.g. if exepected_origin="https://*.example.com" then the verify functions will accept "foo.example.com".

RP ID	Expected origin	Client data origin	Result
example.com	https://example.com	https://example.com	match
example.com	https://*.example.com	https://foo.example.com	wildcard match
example.com	https://*.example.com	https://example.com	no wildcard match

This is the origin of the update:

The Relying Party MUST validate the origin member of the client data.

The Relying Party MUST NOT accept unexpected values of origin.

and

Validation MAY be performed by exact string matching or any other method as needed by the Relying Party.

This PR addresses the "any other method" point by supporting a wildcard in the expected origin value. It doesn't do any structural parsing (scheme, host, domain, port) - it just does a straight str match on the prefix / suffix generated by splitting the expected origin with a wildcard character ("*").

NB This does mean that a simple "*" as the expected origin will match anything.

From https://w3c.github.io/webauthn/#sctn-validating-origin

CLAassistant commented 11 months ago

All committers have signed the CLA.

MasterKale commented 11 months ago

@hugorodgerbrown Did something happen? The work in here still seemed valid, I only objected to blocking support for localhost URLs when I closed #184. Otherwise I thought the wildcard support was a nice touch.

hugorodgerbrown commented 11 months ago

Hi @MasterKale - apologies, I thought I'd posted a comment but apparently not.

I closed it because I noticed on re-reading that I was looking at the latest version (Editor's Draft), which is (presumably) not published formally.

MasterKale commented 11 months ago

Don't let that dissuade you! The stuff in the Editor's Draft will become L3, and we'll (those of us in the Web Authentication Working Group) start finalizing L3 by the end of the year to become the official next version of WebAuthn. I don't have any hesitation to adding L3 functionality - case in point the library already supports "hybrid" transport (here), and it knows about authenticatorAttachment too (here), both of which are only in the Editor's Draft of L3.

hugorodgerbrown commented 11 months ago

those of us in the Web Authentication Working Group

Ah - if I'd known your position on that I would have asked some more difficult question ;-)

I'll re-open - I've spent more time on the spec getting my head around the RP ID vs. expected origin checks (which was also my reservation, I wasn't totally sure I understood what was going on), so it should be tighter this time round.

I'll take a look this evening (UK time) and get it back on track.

hugorodgerbrown commented 10 months ago

I'm going to suggest closing this. The commit I pushed earlier is overly complex, but highlights the fact that there is no simple answer to wildcards that prevents the "*.com" scenario that doesn't end with the question "how can I parse a subdomain", which is an SO favourite?

tl;dr you can't without maintaining a list of TLDs.

You could support a very limited subset of allowed wildcard patterns, but then if that's the case you may as well just pass in a long list of acceptable origin values, which is already supported.

It would be nice to support wildcards, but preventing abuse (/mistakes) is problematic.

MasterKale commented 10 months ago

@hugorodgerbrown Thank you very much for attempting to pull this off. I generally agree with your conclusion, so I'm going to take you up on your suggestion to close this PR. Please don't let this dissuade you from trying again if you happen to figure out a better path forward 🙏

Stormheg commented 5 months ago

Apologies for commenting on a closed PR, wanted to do a quick check here before opening a new issue.

The python-fido2 library is doing exactly what has been concluded here: it performs a check against a hardcoded list of TLDs. It uses a hardcoded copy from https://publicsuffix.org/

Relevant code: https://github.com/Yubico/python-fido2/blob/edcb00bc327453cfcf549ef4f12cc25a88b6b0a4/fido2/rpid.py#L44-L74

@MasterKale would you deem this an acceptable solution to implement in py_webauthn? I'm happy to submit a PR, though unsure about any licensing issues that may occur from reusing python-fido2 code.

MasterKale commented 5 months ago

@Stormheg That's an interesting proposal, the code itself seems pretty straightforward, but public_suffix_list.dat is 311KB of plaintext with almost 7700 records...

Screenshot 2024-04-26 at 9 41 26 AM

Screenshot 2024-04-26 at 9 42 47 AM

To me that's too much "weight" to add to this library - it could make using py_webauthn very RAM-intensive, not to mention parsing and iterating through all the records on every registration because this library is not well-positioned to cache and map the parsed values in a way that every RP implementation could benefit from.

I'm willing to talk a bit more about it but right now given these facts I don't think it's a good idea to go down that path here.

Stormheg commented 5 months ago

@MasterKale if the size of public_suffix_list.dat is a concern, removing all comments and empty lines bring it down to about 130kb. Compressing using gzip with --best takes it down to about 40kb.

$ cat public_suffix_list.dat.txt | grep -v ^$ | grep -v "^//" > public_suffix_list_clean.dat
$ gzip --best public_suffix_list_clean.dat
$ du -k public_suffix_list*
# size in kb...
304 public_suffix_list.dat.txt
128 public_suffix_list_clean.dat
40  public_suffix_list_clean.dat.gz

Packing an extra 40kb sounds reasonable to me.

Django uses a very similar mechanism to check for common passwords. The gzipped-list of passwords it uses is 80kb. Compared to python-fido2, it uses a set to store the parsed lines which is much faster to check if it contains a certain value. Relevant source here: https://github.com/django/django/blob/4470c2405c8dbb529501f9d78753e2aa4e9653a2/django/contrib/auth/password_validation.py#L228-L237

Granted, Django only reads and parses this file when a validating a new password. Like you mentioned, we'd be reading, decompressing, and parsing the file contents much more often without any caching. I agree this isn't desirable; but will it be a bottleneck in practice?

Benchmark

I tried a quick benchmark and on my system; reading, decompressing, parsing and finding a needle in the gzipped list takes about 0.002 seconds on average. Doing this 10,000 times in row takes a lot longer if you can't cache results - obviously. Tested on a Macbook Pro M1 Max, Python 3.12.3

import gzip
import timeit

filename = "public_suffix_list_clean.dat.gz"
needle = "appspot.com"

def read_and_check_suffix():
    with gzip.open(filename, "rt", encoding="utf-8") as f:
        suffixes = {line.strip() for line in f.readlines()}

    if needle in suffixes:
        return True
    return False

if __name__ == "__main__":
    REPEAT = 10_000
    duration = timeit.Timer(read_and_check_suffix).timeit(number=REPEAT)
    print(f"Read {REPEAT} times: {duration:.4f} seconds")
    print(f"Average per cycle: {duration / REPEAT:.5f} seconds")

$ python benchmark.py
Read 10000 times: 21.1129 seconds
Average per cycle: 0.00211 seconds

Overall, I'm not super concerned with performance issues myself. But if the answer is still no, that's acceptable for me too.

MasterKale commented 5 months ago

Interesting, I'd be curious to see how this benchmark runs on a "typical" VPS (I dunno, a cheap DO droplet or the like) vs insanely capable Apple Silicon...

Let me think on this over the weekend and get back to you next week 🤔

jwag956 commented 5 months ago

I took a look at the list - it is changing very frequently - so you definitely don't want to include it in this package. However possibly adding the feature to check IF the application has configured it. One way to do this is a pattern I have used In Flask-Security - add a base-class that implements the required methods that py_webauthn would need to call (such as 'get list' or 'iterate list'. Then application can sub-class that and pass it in as part of initialization. That way - py_webauthn isn't assuming anything about how the list is maintained nor how it is stored/access by the application. And the default is 'no list'.

Stormheg commented 5 months ago

Fair point, I had not noticed the public suffix list evolves this fast. Though all recent changes appear to mostly add new entries for public suffixes similar to the *.compute.amazonaws.com example on https://html.spec.whatwg.org/multipage/browsers.html#is-a-registrable-domain-suffix-of-or-is-equal-to.

I suppose shipping an outdated lists could pose a problem because we'd allow the origin check to pass when it should not according to the latest public suffix list. Not sure if that is enough of a con to completely scrap the idea of shipping a list. It appears to be good enough for the developers of python-fido2 and I'd feel somewhat inclined to agree with that. 100% accuracy is a lot harder to achieve, working off a semi recent list may be good enough?

Notably, python-fido2 also supports passing a custom callable parameter to override the origin check with your own if the default validation doesn't suit, see https://github.com/Yubico/python-fido2/blob/8b979313e9ac187cfa06f0882a3050f40868bcce/fido2/server.py#L147

For my own purpose, I've opted to just pass py_webauthn a list of all expected origins like what the current api needs. More configuration for me, but it is always accurate. Passing an explicit list of expected origins turns out to works well enough for me. That said, I would appreciate a way to override the check like can be done with python-fido2. That way I could implement my origin verification mechanism (and potentially shoot myself in the foot by implementing it badly? 😆)

Food for thought - would love to hear what conclusion you arrive at Matthew.

duo-labs / py_webauthn

Add support for subdomain matches in origins #183

Benchmark