allinurl / goaccess

GoAccess is a real-time web log analyzer and interactive viewer that runs in a terminal in *nix systems or through your browser.
https://goaccess.io
MIT License
18.47k stars 1.11k forks source link

Majority of clients are Unknown #152

Open da2x opened 10 years ago

da2x commented 10 years ago

Just a general bug tracking what all my pull requests have been about.

For my own sites, I am still at 62 % unknown for OS and 42 % unknown for browsers.

allinurl commented 10 years ago

This is probably one of those things that depends on the type of traffic the site gets. I'm thinking that perhaps adding a panel or a dialog that displays all the unknown user agents would help the user to expand the list, if needed.

cganterh commented 9 years ago

Same problem here.

ghost commented 9 years ago

and here

allinurl commented 9 years ago

I'll look further into this. Thanks.

aphorise commented 9 years ago

@daniel-gomes-sociomantic, @Aeyoun & @cganterh - can you guys kindly provide a sample of target / known User-Agent's (UA) and or related OS / Browser that you're dealing with - yet are not showing?

It would be great to see UA string's that are (humanly) reasonable to assume to be of an OS &or browser yet are not being parsed, categorised and understood as expected.

cganterh commented 9 years ago

Sorry, I haven't used this software lately.

2vek commented 9 years ago

@aphorise Here is a sample from my unfinished site's access_log and goaccess config to parse it.

Looks to me Baiduspider(Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)) is getting categorized into unknown OS.

aphorise commented 9 years ago

@2vek - thank you for the sample. Out of interest does the BlackBerry9000 device / OS show & are all the UA that you've provided (4 variance I think) not showing?

allinurl commented 9 years ago

Thanks for posting this.

About Baiduspider shown as unknown is that the majority of Crawlers listed here are not listed under the OS list since we are not sure what OS their bot is running on. At the moment, I'm not entirely sure how to categorize crawlers under the OS panel. Any thoughts on this?

aphorise commented 9 years ago

@allinurl - I'd say (related to #10) - if we conclude with or use a single UA-DB - which'd hold all browser : device : other-ua for lookups - then the other-ua portions will predominantly be the service/bot/crawler agents as per whats listed on the public directories.

I think services, bots & crawlers are all fitting appellations subject to the whats matched / recognized. The slight difference between bot & crawlers being that the later tend to be Search Engine specific.

allinurl commented 9 years ago

@aphorise the UA-DB discussed on #10 sounds like it would be an interesting approach, however, now that I think about it, I'm curious how UA versioning would work...

aphorise commented 9 years ago

@allinurl - if I've not misunderstood you - there would be no versioning only a comprehensive and complete listing. So in the case of UA-DB being present only thats used and even the current conditions you have (regex style as per whats in browsers.c & opesys.c) when compiled into the same list would be one (1x) of X conditions. The only version difference would be that of the UA-DB which would naturally have more records / increase into the future. Where there is no UA-DB then the current approach / conditional checks that you have can work fine or a in-memory build of it that compiles to a complete list / directory (hash-table) of all permutations in case of an unfulfilled targeted match (by device:os:browser in UA-DB).

I'm going to see if I can mock something standalone around this & earlier discussed (hashing) ideas.

allinurl commented 9 years ago

@aphorise a mock will be great :+1: .Thanks for clarifying this a bit more.

2vek commented 9 years ago

@aphorise all request for BlackBerry9000 shows under "others" section in OS. That log does contains few UA's that are showing up fine. I admit I was not very thorough in filtering.

@allinurl putting baiduspider in others section should be fine. I think OS of a bot should not matter to end user.

da2x commented 9 years ago

Baidu is a search engine bot.

da2x commented 9 years ago

(Not posting URIs because some are spamish.) Comments after #-symbol.

Some unknown user agents by category (1000 hits or more in the last 48 hours):

**Feed readers:**
AppleNewsBot
Feedbin feed-id:<int> - <int> subscribers
Superfeedr bot/2.0 http://superfeedr.com - Make your feeds realtime: get in touch - feed-id:<int>
Mozilla 5.0 (compatible; Feedio.co Feed Crawler/1.0; +<uri>)
Mozilla/5.0 (compatible; OperaDiscoverBot/2015.01; <uri>)
Mozilla/5.0 (compatible; inoreader.com-like FeedFetcher-Google)
alertmix crawler/1.0 (a news crawler; <uri>; <email>)
Mozilla/5.0 (compatible; theoldreader.com; <int> subscribers; feed-id=<hash>)
Digg Feed Fetcher 1.0 (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)
FeedBurner/1.0 (<uri>)

**Others (high number of requests):**
Mozilla/5.0 (compatible; spbot/4.4.2; +<uri> )
Go 1.1 package http
Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +<uri>)
Mozilla/5.0 (compatible; Gluten Free Crawler/1.0; +<uri>)
Mozilla/5.0 (compatible; MJ12bot/v1.4.5; <uri>?+)
Google favicon
Mozilla/5.0 (Windows NT 10.0; Trident/7.0; FunWebProducts; yie9; rv:11.0) like Gecko
NerdyBot
Microsoft-WNS/10.0 -- Fetches Live Tiles for Windows 10's Start Menu. Almost an RSS reader? Kind of.
com.apple.Safari.SearchHelper/11601.2.3 CFNetwork/760.1.2 Darwin/15.0.0 (x86_64) # [OpenSearch in Safari](https://www.aeyoun.com/webdev/safari-quick-website-search.html)
Y!J-ASR/0.1 crawler (<uri>)
WinHTTP
Sogou web spider/4.0(+<uri>)
Mozilla/5.0 (compatible; DuckDuckGo-Favicons-Bot/1.0; +<uri>)
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Web Preview) Chrome/27.0.1453 Safari/537.36
ltx71 - (<uri>
Qwantify/1.0
NetLyzer FastProbe (See <uri> for info))

**Browsers:**
Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.8.1000 Chrome/30.0.1599.101 Safari/537.36

**Advertising:**
Mediapartners  # Google AdSense, 10 000 hits a month
ADmantX Platform Semantic Analyzer Appnexus - ADmantX Inc. - <uri> - <email>
Mozilla/5.0 (compatible; proximic; +<uri>)
Mozilla/5.0 (compatible; GrapeshotCrawler/2.0; +<uri>)
Nutch/2.2.1 (page scorer; <uri>)
Mozilla/5.0 (compatible; adidxbot/2.0; +http://www.bing.com/bingbot.htm)
allinurl commented 9 years ago

@Aeyoun Thanks for posting this. Are these not being recognized under the browsers or os panel, or both?

da2x commented 9 years ago

This was about browsers. Maxthon is identified as Chrome. The other end up in "Unknown".

We could make some educated guesses to improve on OS detection, though. Google should get their own OS. Because they don't use anything known and certainly use all custom hardware and software. Here is a User-Agent to OS matching:

Pulling the above out of the unknown OS category should drop my unknowns from 68% to 57%.

aphorise commented 9 years ago

IMO google is not an OS and should not get a separate category. Unkown or Unidentified may be more fitting where no OS has been declared or included in the UA. There are some good guesses that can be made for typical or common use-case / scenarios involving crawlers, bots & services in general however they'd remain a guess / assumption at best.

areis422 commented 8 years ago

Having the same issue:

6 - Operating Systems
Total: 1/1
Hits    Vis.       %   Bandwidth Data
------- ---- ------- ----------- ----
1401742 3459 100.00%     0.0   B Unknown

7 - Browsers
Total: 1/1
Hits    Vis.       %   Bandwidth Data
------- ---- ------- ----------- ----
1401742 3459 100.00%     0.0   B Unknown
allinurl commented 8 years ago

@areis422 Do you have the right format?

Since it only recognized 1 browser/os (all unknown) , seems like you may not have the right log format. Please double check that, otherwise feel free to post a few lines from your log and the log format being used.

areis422 commented 8 years ago

Standard Apache logs, using (CLF):

time-format %H:%M:%S
date-format %d/%b/%Y
log-format %h %^[%d:%t %^] "%r" %s %b
areis422 commented 8 years ago

Switched to NCSA format and I'm getting browsers and O/S now. Sorry for the bother.

pluscubed commented 7 years ago

It would be great to be able to see the different "Unknown" UAs. I'm currently using GoAccess with an API, so I set my own unique UA in my client app.

allinurl commented 7 years ago

@plusCubed #560 will add the ability to load your custom list of browsers. From this comment, I'll probably add someway of displaying some of the most popular UAs from the unknown category.

szepeviktor commented 6 years ago

Strange HUAWEI + Android + Facebook + X See https://github.com/allinurl/goaccess/issues/997