ThaDafinser / UserAgentParserComparison

Comparison results from UserAgentParser
http://thadafinser.github.io/UserAgentParserComparison/
MIT License
28 stars 7 forks source link

Quality check #14

Open NielsLeenheer opened 8 years ago

NielsLeenheer commented 8 years ago

So far the results of UserAgentParserComparison have been extremely useful for me, but the overview page of the parsers can be very misleading if you take the numbers at face value. Having a result for a particular user agent string does not mean it is a good result. Perhaps having no result would have been better.

I've been thinking about how to test the quality of the various user agent detection libraries. I'm not talking about nitpicking details like spelling of model names or stuff like that. Even things like 'Opera' vs. 'Opera Mobile' is not that important. I'm talking about clear detection errors.

What I've come up so far is:

For example if we have the following string: Opera/9.80 (X11; Linux zvav; U; zh) Presto/2.8.119 Version/11.10

The test would pass if:

An identification of Opera on Linux would be the obvious mistake.

Another example: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Atom/1.0.19 Chrome/43.0.2357.65 Electron/0.30.7 Safari/537.36

The test would pass if:

This is actually the Atom editor and not the Chrome browser.

And finally: Mozilla/5.0 (Linux; Android 5.0; SAMSUNG SM-N9006 Build/LRX21V) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/2.1 Chrome/34.0.1847.76 Mobile Safari/537.36

The test would pass if:

I do realise that sometimes results are open to interpretation, but I think that despite that, it might be a useful way to identify common problems in libraries and help them raise the quality of their identifications.

Is this something you want to work on with me?

ThaDafinser commented 8 years ago

@NielsLeenheer thank you for the input. This is exactly where i want to go... :smile:

(i personally also dont like the wrong results)

What I've come up so far is:

  • Create a handpicked list of tricky user agent strings
  • For each user agent string determine what we're interested in looking at
  • Determine a list of acceptable answers, or known false identifications
  • Check if we have an acceptable match, and do not have a false identification
  • Determine the percentage of correct results

If i understand it correctly, this would be hard to achieve with the amount of data. And it's hard to keep up to date (especially the handpicked list of tricky user agents)

My idea is a bit different, but in many parts similar.

Then we have a percentage, which (can) indicate (possible) other wrong results (based on each value). This approach is not 100%, but will be easy to maintain and should provide a good indicator of possible wrong results.

This indicator would also come to the summary front page, to show which providers do real detection.

I already created some work here (but this is not finished) https://github.com/ThaDafinser/UserAgentParserComparison/tree/master/src/Harmonize https://github.com/ThaDafinser/UserAgentParserComparison/tree/master/src/Entity https://github.com/ThaDafinser/UserAgentParserComparison/blob/master/src/Evaluation/ResultsPerUserAgent.php https://github.com/ThaDafinser/UserAgentParserComparison/blob/master/src/Evaluation/ResultsPerProviderResult.php

NielsLeenheer commented 8 years ago

If i understand it correctly, this would be hard to achieve with the amount of data. And it's hard to keep up to date (especially the handpicked list of tricky user agents)

Yes, this is one of the drawbacks. It is will be difficult to compile a list of tricky user agents. I think we should have a least a hundred different to start with, to be able to say something meaningful. And the more we can get the better. I'll gladly spend some time on this.

Then we have a percentage, which (can) indicate (possible) other wrong results (based on each value). This approach is not 100%, but will be easy to maintain and should provide a good indicator of possible wrong results.

I've considered this approach for a while, but I think there is a fatal flaw with it. Being right isn't a democracy. One may be right and all of the others wrong. I would say this approach is pushing towards harmonisation and not necessarily good quality.

Take for example the Opera Mini user agent string: Opera/9.80 (X11; Linux zvav; U; zh) Presto/2.8.119 Version/11.10

If you look at the results you'll see that everybody except WhichBrowser is fooled. Everybody thinks it is Opera 11 running on Linux, while "zvav" indicates it is the desktop mode of Opera Mini trying to mislead you into thinking it is the desktop version. Add 13 to every letter of "zvav" and you get "mini"

ThaDafinser commented 8 years ago

I've considered this approach for a while, but I think there is a fatal flaw with it. Being right isn't a democracy. One may be right and all of the others wrong. I would say this approach is pushing towards harmonisation and not necessarily good quality.

You are right...didnt had that in my mind.

Yes, this is one of the drawbacks. It is will be difficult to compile a list of tricky user agents. I think we should have a least a hundred different to start with, to be able to say something meaningful. And the more we can get the better. I'll gladly spend some time on this.

You just led me to an (i think awesome) idea....

This comparison already uses (only) the testsuites of the parsers, which provide it in a independet way In those test suites there are already all expected values defined

So why no reuse them in this case? We would have 100% of all expected values...

In combination with value harmonization it should be possible?

NielsLeenheer commented 8 years ago

This comparison already uses (only) the testsuites of the parsers, which provide it in a independet way In those test suites there are already all expected values defined

I hadn't thought of this, but I can see some obstacles to overcome. I know that the WhichBrowser test data contains expected results that are flat out wrong. The expected results in my test data are in there to be able to do regression testing, not as an indication of a correct result. I expect this to be true of all of the data sources.

Just think of it like this: we know there are different results between libraries. That must mean some libraries must make mistakes sometimes. We know that every library passes its own test suite. That logically means the expected results of the test suites contain mistakes.

I'm starting to think that a curated list of test makes more and more sense. We could define a list of say 100 strings and call that the "UAmark 2016" test - or something like that. Because it is curated, the test is always the same and results can be tracked over time.

I'm also starting to think, it should not contain just tricky user agent strings. It doesn't have to be a torture test. I could contain some basic strings for desktop and mobile, some more exotic ones and some tricky ones. The only thing is, we have to manually confirm the expected results. But I don't think that will be a big problem. And with the combined test suites we have a lot of user agent strings to choose from.

ThaDafinser commented 8 years ago

I hadn't thought of this, but I can see some obstacles to overcome. I know that the WhichBrowser test data contains expected results that are flat out wrong. The expected results in my test data are in there to be able to do regression testing, not as an indication of a correct result. I expect this to be true of all of the data sources.

That could be a little problem... Why not test the regression with real testdata? Of how many test cases are you talking about? (just by guessing)

Just think of it like this: we know there are different results between libraries. That must mean some libraries must make mistakes sometimes. We know that every library passes its own test suite. That logically means the expected results of the test suites contain mistakes.

It's not always a (direct) mistake. Sometimes (like the example above) it's just not a covered case. So the other test cases are still valid, i just needs a new one for the uncovered one.

I'm starting to think that a curated list of test makes more and more sense. We could define a list of say 100 strings and call that the "UAmark 2016" test - or something like that. Because it is curated, the test is always the same and results can be tracked over time.

I somehow like the idea, but i see some main problems

Maybe we can do a chat about this some time?

NielsLeenheer commented 8 years ago

Why not test the regression with real testdata?

Oh, every string is a real user agent string. But the parsing by the library itself may not be perfect yet. And the expected result mirrors the state of the library. It happens regularly that an improvement to the library causes the expected results to change. And that is okay, as long as it is better than before and not unexpected.

currently 8751 user agents are used for the result, and only 100 are in this suite

Yes, because but I don't think the limited size is a problem. On the contrary. It allows us to select specific strings that we think are important, instead of having 5000 variants of Chrome in the test suite.

if you focus on those 1% you get this "award"

You should definitely keep the whole test set, but I think this quality mark can be an addition to the overview page.

what is more important? browser detection? device detection? bot detection

That is a good question. Personally I think browser detection should be most important and maybe some device detection. For bot detection I would use completely different criteria, because just looking at 100 strings won't tell much about how well a library supports detecting bots.

how defines those user agents? since this comparison should be as independent as possible

That is the hardest part I think. We should start be setting goals, determining rules and base the selection on that.

Some off the cuff ideas:

Maybe we can do a chat about this some time?

Sure, I'll send you a DM on Twitter to schedule something.

ThaDafinser commented 8 years ago

@NielsLeenheer in the meantime (until we got our meeting), i add the unitTest results to the UserAgent table, so following information can be displayed on the single UserAgent detail page:

Regardless of this goal we discussed here, i think this is useful.

NielsLeenheer commented 8 years ago

:+1:

Did you get my direct message, BTW?

ThaDafinser commented 8 years ago

Yep i got it :smile: Just need to get into

ThaDafinser commented 7 years ago

Following steps are necessary to get this finally moving

At this point, the useragent and their result is checked and added to the useragent detail page (so already useful).

For the badge calculation: