Fleshgrinder / php-url-validator

In search of the perfect URL validation regex
MIT License
14 stars 7 forks source link

Username portion causes failure if > 16 characters #4

Open jaydiablo opened 8 years ago

jaydiablo commented 8 years ago

I've looked over the regex and can't seem to figure out why this would happen.

We had a case where php-url-validator was rejecting a URL because the username was too long.

For example, if I put this URL in the test suite, it fails:

https://abcefghijklmnopst:dklsfjkdsfjkldsfdsdsfdsffsdfsafsdafdsafdsafdsasfdsfdsfsdb@api.twilio.com

However, if I reduce the username by 1 character, it passes:

https://abcefghijklmnops:dklsfjkdsfjkldsfdsdsfdsffsdfsafsdafdsafdsafdsasfdsfdsfsdb@api.twilio.com

Any idea why this might be happening? (Notice that the password field can be any length, i.e., this still fails)

https://abcefghijklmnopst:dklsf@api.twilio.com

but this passes:

https://abcefghijklmnops:dklsf@api.twilio.com

I tried changing the named parameter in the regex to something other than "username", just in case, but it still fails if over 16 characters.

jaydiablo commented 8 years ago

My colleague @aripringle was doing some additional testing on this and he discovered that the preg_match call that the URL validator is doing is actually returning false for the URLs that have a username component longer than 16 characters rather than 0.

preg_last_error() shows "2", which apparently is this constant: PREG_BACKTRACK_LIMIT_ERROR.

He was able to make the URL pass validation if he bumped the backtrack limit as high as 5,000,000:

ini_set('pcre.backtrack_limit', 5000000); 1,000,000 wasn't enough, which apparently is the default (https://secure.php.net/manual/en/pcre.configuration.php#ini.pcre.backtrack-limit)

Fleshgrinder commented 8 years ago

This is the result of the current .+ rule for both username and password. Limiting it to the exact range of characters that are allowed in there (unreserved + percentage encoded + colon) would overcome the catastrophic backtracking issue. The thing is, a modern URL validator should not even parse the username and password and simply treat it as user info. Splitting of the user info into parts is up to the software. This is because submitting such information within a URL is a security problem.

In other words, username and password should be removed in favor of user info and the group should be:

(?'user_info'(?:[a-z0-9\-\._~:]|%[a-f0-9][a-f0-9])++@)?

This avoids the backtracking issue and complies better with the actual RFC. If you want to keep support for direct splitting, do the following:

(?:(?'username'(?:[a-z0-9\-\._~]|%[a-f0-9][a-f0-9])++):(?'password'(?:[a-z0-9\-\._~:]|%[a-f0-9][a-f0-9])++)@)?
jaydiablo commented 8 years ago

I tried both of those approaches and still encounter the backtrack limit error (at the default limit of 1,000,000).

Fleshgrinder commented 8 years ago

I will investigate further but I'm on vacation right now and will be back home next week. I'll get back to you then.

jaydiablo commented 8 years ago

No worries, enjoy your vacation!

I've been messing with it a little bit, no progress yet though.

Fleshgrinder commented 8 years ago

An atomic group should solve it:

            (?>
                (?'username'[^@]+)
                (?::(?'password'[^@]+))?
            @)?

I have a very tight schedule these days and not much time. You might want to consider using another, better maintained package instead (e.g. thephpleague/uri although a bit over engineered).

jaydiablo commented 7 years ago

Unfortunately that change doesn't seem to affect the backtracking behavior, I still get the backtrack limit exceeded error on this url:

https://abcefghijklmnopst:dklsffdsafdsafsda@api.twilio.com

I can certainly look at the League URI package. I'm curious how it performs against the URLs in your test suite.

Fleshgrinder commented 7 years ago

Making sure that the atomic group only matches once will help:

            (?>
                (?'username'[^@]+)
                (?::(?'password'[^@]+))?
            @)??

I will overhaul the library a bit and release a new version soon.

jaydiablo commented 7 years ago

Ah, nice! That fixes it.

It doesn't seem necessary for the tests to pass, but would it affect performance/efficiency at all to exclude a colon from the username portion as well?:

            (?>
                (?\'username\'[^@:]+)
                (?::(?\'password\'[^@]+))?
            @)??
Fleshgrinder commented 7 years ago

The RegEx is not very performant nor efficient and the library should not distinguish between username and password but treat it all simply as user info (RFC 3986).

This is what I have now for the next major of this lib:

~^
    (?:([a-z][a-z\d+.-]+):)?
    (?://
        (?:([^@]+)@)?
        (
            (?:xn--[[:alnum:]-]+|(?!..--)[[:alnum:]\x{00a1}-\x{ffff}-]+(?<!-))
            (?:\.(?>xn--[[:alnum:]-]+|(?!..--)[[:alnum:]\x{00a1}-\x{ffff}-]+)(?<!-))*
        |
            \[[\.:\da-f]+\]
        )
        (?::(\d+))?
    )?
    (/(?!/)[^?\#\s[:cntrl:]]*)?
    (?:\?([^?\#\s[:cntrl:]]+))?
    (?:\#([^\s[:cntrl:]]+))?
$~Dixu

It's not fully tested yet but much closer to RFC 3986 and really fast. It parser your https://abcefghijklmnopst:dklsffdsafdsafsda@api.twilio.com in 66 steps in less than a millisecond. The old one takes 427 steps with the fix I provided and more than 4 milliseconds.