3rd-Eden / useragent

Useragent parser for Node.js, ported from browserscope.org
MIT License
897 stars 137 forks source link

Fix parsing Issue 63 https://github.com/3rd-Eden/useragent/issues/63 #79

Open User4martin opened 9 years ago

User4martin commented 9 years ago

Proposed solution to issue 63.

Tested with nearly 3000 agents. (using updated reg-ex)

  CloudyTabs/1.4 CFNetwork/720.1.1 Darwin/14.0.0 (x86_64)
  Safari/10600.4.10.7 CFNetwork/720.2.4 Darwin/14.1.0 (x86_64)
  com.apple.WebKit.WebContent/10600.4.8 CFNetwork/720.2.5 Darwin/14.1.1 (x86_64)

used to be "Mac OS X" version: 10 are now "Mac OS X" version: Network

   MobileSafari/600.1.4 CFNetwork/711.0.6 Darwin/14.0.0  
   Twitter/6.18.3 CFNetwork/711.0.6 Darwin/14.0.0
   Web/1.0 CFNetwork/711.1.12 Darwin/14.0.0

used to be "iOS version: 7 (or 8) are now "iOS" version: Network

The version is wrong. This is because, I kept the behaviour of using the 2nd, 3rd and 4th match, if there is no "os_v1_replacement" (see file update.js, last argument to getReplacement())

Most of them only have "os_replacement":

  - regex: '(CFNetwork)/(5)48\.0\.3.* Darwin/11\.0\.0'
    os_replacement: 'iOS'
  - regex: '(CFNetwork)/(5)48\.(0)\.4.* Darwin/(1)1\.0\.0'
    os_replacement: 'iOS'

Quit a lot of changes. the data is now much better allocated to

  ua.device.family
  ua.device.major
  ua.device.minor

Some example for changes (old => new)

iPhone5,2 / 0 / 0 => Apple / iPhone5,2 / 0       // applies to all iPad, ipod, IPhone
XT1068 / 0 / 0 => Motorola / XT1068 / 0
Samsung SM-T805 / 0 / 0 => Samsung / SM-T805 / 0
Lumia 925 / 0 / 0 => Nokia / Lumia 925 / 0
D6503 / 0 / 0 => Sony / D6503 / 0
BlackBerry 9000 / 0 / 0 => BlackBerry / 9000 / 0
LG- / E610 / 0 => LG / E610 / 0
Kindle Fire / 0 / 0 => Amazon / Kindle Fire / 0
R831K / 0 / 0 => Generic_Android / R831K / 0
ONE TOUCH 4015D / 0 / 0 => Generic_Android / ONE TOUCH 4015D / 0
ALCATEL ONE TOUCH 4030X / 0 / 0 => Generic_Android / ALCATEL ONE TOUCH 4030X / 0

Nokia / 0 / C1-01 => Nokia / C1-01 / C1-01 

The Nokia now has some duplication. Probably due to keeping $2, $3, $4 / see above

3rd-Eden commented 9 years ago

It looks great, but I do wonder about the potential performance impact of the newly introduced function call.

User4martin commented 9 years ago

replaceRegExResult() is only called once when a user-agent is found. And then up to 2 extra times, when "os" or "device" are accessed.

Given that to find the result for each user-agent many regexp against the UA have to be checked, this will IMHO hardly matter.

Accessing "os" or "device" also runs more of the regex matching the UA. So again the time goes to matching those.

It would be possible to reduce the inner loop

for(var i=1; i<=9; i++) { // $1..$9

currently the maximum is $4. Limiting it to far might break future update. But setting it to max 5 might be ok,

If you have an idea how to improve that otherwise, let me know.


If performance is a big issue, here is an idea (but not one I will do any work on) Several regex start the same

parser[0] = new RegExp("(Firefox)
new RegExp("(Fennec)
new RegExp("(Opera

But detecting them must take into account, that the term is not optional (no ? or | ). With a good regex analyser terms can also be found in the middle of the regex.

For those a "guardian" could be created (a regex or indexOf) checking if the term in the regex does exist. If the first such regex is reached, the guardian is run. If the guardian failed all of the guarded regex can be skipped.

User4martin commented 9 years ago

Btw, I thought about avoiding the duplication.

Nokia / 0 / C1-01 => Nokia / C1-01 / C1-01 

I tried keeping track (in update.js) whih $n are used. But then I ended up getting other issues. e.g. LG has a pattern that only uses $2. When the code detected $1 was not used and put it on the first free field, it was replaced by "LG-". Not good.either.

I might still look at something where the fallback stays as it is ($2 in 2nd field, $3 in 3rd. $4 in 4th), but is est to empty if that replacement was used before. That way it will not introduce using $1 or $5... except where they are in the downloaded data.

User4martin commented 9 years ago

I added some more to optimize the speed and remove duplicate fields.

This also pushed another commit, which is unrelated.

Revision: da66fbe56dcfd013013175c967ae1385ba10eb82
Author: User4martin <n>
Date: 11/08/2015 19:44:52
Message: Allow setting values for unknown family or version

That way one can localize the "Other" value (or in my case set it to undefined.


*\ Important This needs a rebuild of the regex file, as the format changes

Revision: 2d7a45ea68ef54ae5d1781c229ef9a9bd6dc0cfd
Author: User4martin <n>
Date: 11/08/2015 21:44:44
Message: check for duplicate replacements / optimize replacements in found result

Does 2 thinks (sorry one commit)

1) For each field it stores what $n needs to be replaced. This avoids calling replace on none existing $n in the live code

2) If a $n was already used in a previous field, then it is not used as default (default then is empty) So if

  - regex: 'Android Application[^\-]+ - (Sony) ?(Ericsson)? (.+) \w+ - '
    device_replacement: '$1 $2'
    brand_replacement: '$1$2'
    model_replacement: '$3'

Makes the 2nd field "$3". The default for the 3rd field is also $3, therefore the 3rd field will be empty.

darcyclarke commented 7 years ago

@3rd-Eden Ping. Is there a reasons this hasn't been merged? Just wondering if this isn't a valid improvement/fix then some feedback could probably move it in the right direction.

3rd-Eden commented 7 years ago

Because I'm enjoying my free time.

darcyclarke commented 7 years ago

@3rd-eden apologize, didn't mean to make that sound as snarky as I'm sure it came across. Could this PR be merged though?