Case-sensitive issue - Githubissues

AdrienBr commented 3 years ago

Hi, I think there's a case-sensitive issue, for example with this user agent :

mozilla/5.0 (macintosh; intel mac os x 10.16; rv:84.0) gecko/20100101 firefox/84.0 -> parsing fails Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:84.0) Gecko/20100101 Firefox/84.0 -> parsing works

Can anyone take a look ?

Thanks

NielsLeenheer commented 3 years ago

The reason is that user agent strings are case sensitive. I can't think of one from the top of my head, but there are examples where the case matters for the correct identification.

In this case Firefox is not the same as FireFox and firefox. Only the first indicates that it is a real user agent string from the Firefox browser.

summercms commented 3 years ago

@NielsLeenheer

The reason is that user agent strings are case sensitive. I can't think of one from the top of my head, but there are examples where the case matters for the correct identification.

I'm just wondering if these are really old user agents like dead browsers and http headers that are using mixed cased.

I would like to upgrade this repo to using all pure lowercase and also add some security to the incoming user-agent to remove illegal characters that could cause a ddos situation. We noticed on some tests, inputting really long stupid user agents this repo takes a very long time to process and causes errors. Some time in the near future, we are looking at wanting to update the code to address both issues with performance and security.

Would love to hear your input on these issues.

NielsLeenheer commented 3 years ago

In my tests - which were a couple of years ago - I hardly found a speed improvement with doing purely lower-case comparisons. It had hardly any effect. Case insensitive regular expressions already seem to be very optimised.

The only big improvements came with reducing the total number of regular expressions that need to be evaluated. Most of these changes were done by doing one big case-insensitive regular expression to test if a whole block needs to be evaluated. So instead of lets say 20 regular expressions you do 21. But only when there is a reasonable expectation one of those 20 originals is going to match. Otherwise you just do the one. So in most cases you've reduced the number of expressions by 19.

Why not do everything case-insensitive? You could. But you would get a certain number of extra false-positives. From fake user agent strings and bots. Bots that copied a user agent string but not perfectly.

In case of the example above: Firefox will never output a user agent string that looks like this: mozilla/5.0 (macintosh; intel mac os x 10.16; rv:84.0) gecko/20100101 firefox/84.0. So the user agent string has been corrupted along the way, for example by storing it lower case. Or it is not from Firefox.

So the question is, do we want to add parsing 'corrupted' strings with the drawback of adding more false positives.

summercms commented 3 years ago

Thanks for the info, I totally agree with your point about corrupted strings. Better to label them as fake bots than allowing them through from changing all the UA's to lower case. I'm not going to bother to convert everything to lower case now.

Still playing around with some security settings and adding a max length limit to UA's.

This issue is a good one, instead of coming back blank, maybe it should come back saying fake firefox. I've created pr's outputting fake bots, I never thought about adding fake browsers until now.

Maybe this would confuse people, though?

NielsLeenheer commented 3 years ago

Adding some sanity check for the length would be a great addition. There is no reason why it should be longer than 1024 bytes for example. And that number is already extremely generous.

I'll add it to 2.1.1.

As for detecting fake browsers. I would not bother at this time. It might be a future addition, but I think it might be nicer to add a detection certainty score or something. For example a default number of 0. A negative number if we know for certain there is something fishy about the string. And a positive number if multiple headers confirm the detection. And we could do that as a whole or even for each component.

That way we could even add case-insensitivity. But mark it with a score of -1 because it is slightly corrupted. There may be nothing wrong with it, but it is also not original.

And we could allow users to turn off bot detection as an additional option, so it will be detected as the browser they are mimicking, but give them a negative score of something like -2 or even more, to mark them as suspicious.

WhichBrowser / Parser-PHP

Case-sensitive issue #622