allinurl / goaccess

GoAccess is a real-time web log analyzer and interactive viewer that runs in a terminal in *nix systems or through your browser.
https://goaccess.io
MIT License
18.53k stars 1.11k forks source link

Multiply unique visitors for feed services #130

Open da2x opened 10 years ago

da2x commented 10 years ago

The below User-Agents samples is currently counted as one unique visitor. However, their unique User-Agents should be counted as one multiplied by number of subscribers. (Most visit from different IP-addresses but showing the same number of subscribers, risk of over-counting.) Sample implementation.

NewsBlur Page Fetcher - 219 subscribers - http://www.newsblur.com/site/123456/example (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)"`
NewsGatorOnline/2.0 (http://www.newsgator.com; 79 subscribers)
Netvibes (http://www.netvibes.com/; 105 subscribers; feedID: 33718936)
Mozilla/5.0 (compatible; YandexBlogs/0.99; robot; B; +http://yandex.com/bots)10 readers
Bloglines/2.0 (http://www.bloglines.com; 810 subscribers)

It would possibly make sense to do something more interesting with feed subscriptions as well.

allinurl commented 10 years ago

Daniel, if I understand correctly, a feed provider will retrieve your feed only once using the provider IP (i.e., ​NewsBlur) and it will report the total subscriber count as part of the user-agent? Are there any exceptions to this? Can a feed provider fetch the feed twice or more?

da2x commented 10 years ago

Not quite. From what I see in my own logs, the user-agents are the same (includes the same subscription number) but they fetch from different IP-addresses pretty much every time. The way these services work is that they fetch popular feeds more often (like every five minute) and less popular feeds less often (every 9 hours).

I think this logic would work: Look in all user-agents for " subscribers" or " readers". Match the int in front of those matched strings. Exclude the int from User-Agent. Drop every user-agent matching this new int-free user-agent and only count it once. Use the matched int instead for the unique count.

Pitfalls: The number of subscribers can grow through a day.

da2x commented 10 years ago

Damn. Looks like it has to be a hard-coded list. Found out that at least NewsBlur uses three different-purpose User-Agents which all report the subscriber numbers. “NewBlur Page Fetcher”, “NewsBlur Feed Fetcher”, and “NewsBlur Favicon Fetcher”. Only the one called “Feed Fetcher” should be reported as the subscription number (that is the one with the most frequent number of requests for the rss feed).

NewsBlur Page Fetcher - 220 subscribers - http://www.newsblur.com/site/5241507/aeyoun (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)
NewsBlur Feed Fetcher - 220 subscribers - http://www.newsblur.com/site/5241507/aeyoun (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)
NewsBlur Favicon Fetcher - 219 subscribers - http://www.newsblur.com/site/5241507/aeyoun (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)