eldy / AWStats

AWStats Log Analyzer project (official sources)
https://www.awstats.org
369 stars 120 forks source link

Hits vs Pages #137

Open Bllacky opened 5 years ago

Bllacky commented 5 years ago

Hi,

This is a feature request.

Overall I love Awstats for its simplicity. It works very well for most of my needs. However, it has a bit of trouble with bot detection.

I have multiple software to monitor my website's traffic, all using various methods. Awstats is one of them. Overall Awstats is in agreement with the others with one exception. There are lots of bots out there on the internet, that Awstats doesn't detect but which are obvious if you look at the List of Hosts.

Why are these bots obvious, because most modern websites, including mine, have multiple files loaded upon a visit, you have a bunch of CSS files, js files, and so on. So if you see a visit which is 1 page, 1 hit and 10KB in traffic, you know that's not a real visit and probably a bot. If I eliminate these visits from those counted by Awstats, then Awstats statistics are in agreement with those of Google Analytics or Matomo.

So my request is to allow me to set in the config files some parameters based on which AWstats should count what is a real visit and what is a bot.

Example:

MinHitsPerVisit=7;
(default=1)
//Minimum number of Hits required to consider a visit real (human) and not a bot.
MinHitsToPageRatio=1.5;
(default=1)
//Minimum ratio between hits and pages in a visit to consider a visit real (human) and not a bot.
MinVisitTraffic=100KB;
(default=1KB)
//Minimum traffic of a visit to consider that visit real (human) and not a bot.

Implement these 3 parameters and I can make my Awstats come in agreement with other traffic measuring software.

Thank you very much!

visualperception commented 4 years ago

I raised this in issue #59 but no response yet

visualperception commented 4 years ago

Also .... The more people who ask for this the more likely it is to happen. However, I think Eldy seems to have ceased significant development unless it is required to keep it running. I think it will need a perl developer from the community to implement this. Any offers as it would provide a significant improvement to accuracy of awstats for the whole community? I'm not a perl programmer or I would look at it myself.

Bllacky commented 4 years ago

I will try to find someone. @eldy hopefully he will accept the commit if it ever happens.

visualperception commented 4 years ago

Since awstats is being used by several significant webhost companies with hundreds if not thousands of users, it is important that any changes to code such as this do not significantly effect the perfomance of awstats. I think that would be the major concern for eldy when considering wether or not to accept the commit but its not really for me to say.

Bllacky commented 4 years ago

I don't think there will be a significant performance hit, and it's a relatively simple filter. I think we can start a code bounty for this modification.

Like https://bountify.co/, http://www.coderbounty.com/, https://www.bountysource.com/

I'm more than willing to pinch in.

Bllacky commented 4 years ago

@visualperception I may have found someone to look into this issue and possible implement this feature. Is there any way I can contact you?

visualperception commented 4 years ago

If you post an email address I will send you an email which has my email addres in it. You can set up a temporary email address for this purpose at: https://www.gmx.com/ ( or https://www.gmx.co.uk, I'm in the UK ) and then delete it once we have exchanged private email addresses. https://www.gmx.com is part of 1&1 based in Germany. A good email provider without the bloat of gmail. Is your person still willing to do it?

Regards VisualPerception

Bllacky commented 4 years ago

That's very kind of you. I've tried to make an account on gmx but it fails every time with A technical error has occurred. Error Code: eb6a5658-ec9b-4246-b5e5-0d0bf8a01f86

What do you think of a temporary email address such as: https://www.throwawaymail.com/en

visualperception commented 4 years ago

Have you enabled first party cookies.for gmx.com ?

Bllacky commented 4 years ago

Have you enabled first party cookies.for gmx.com ?

Yes, and I have tried it with different browsers. Same result. Seems their website is broken.

visualperception commented 4 years ago

https://www.throwawaymail.com/en or any mail provider, but make sure you can delete the email account from it. My email will temporary so I'm not worried about spam as I can delete/change it.

Bllacky commented 4 years ago

Send to dukobapugu@memsg.top Thank you!

visualperception commented 3 years ago

I sent email to your address several months back but I have heard nothing since. Did you find anyone willing to code this?

Bllacky commented 3 years ago

I sent email to your address several months back but I have heard nothing since. Did you find anyone willing to code this?

I did check that email address for a while, but if I fail to check it 48 hours, it gets deleted. So I reckon your email got lost at some point. Anyway, you can try and send it again here: mitrothesl@matra.site

I did find someone who said they were willing, but in the end nothing came out of it. So at this point there is no one to pick up this modification.

visualperception commented 3 years ago

OK I have sent my email address to your posted email address at 01:05 GMT 2020-10-28

visualperception commented 3 years ago

Ooops, I have received an Message undeliverable from my mailserver as follows:

mitrothesl@matra.site Error Type: SMTP Connection to recipients server failed. Error: Host name: 104.27.164.72, message: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

visualperception commented 3 years ago

OK. I have tried again and so far I have no Message undeliverable message. It seems that email server has been trying to log into my mailserver to send messages already. I had several entries in my blacklist

visualperception commented 3 years ago

Spoke too soon. I just got another Undeliverable message:

mitrothesl@matra.site Error Type: SMTP Connection to recipients server failed. Error: Host name: 104.27.165.72, message: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

Bllacky commented 3 years ago

I will try with a different serivce. Looks like throwaway mail doesn't work. Try bllacky@guerrillamail.com ? Thank you.

visualperception commented 3 years ago

email sent 12:20pmGMT Wed 2020-10-28

Bllacky commented 3 years ago

email sent 12:20pmGMT Wed 2020-10-28

These free mail accounts... they don't seem to work as they should. One last try: bllacky@mail2fun.com

visualperception commented 3 years ago

sent

Bllacky commented 3 years ago

Go it! Finally made contact.

chuckhoupt commented 3 years ago

How would this feature interact with cached resources (css, js, images, etc) and return visits? It seems like it might miss returning visits if resources are set to be cached for long periods. I.e. a first visit triggers 3+ hits for a page, but a second visit later in the day may only be single hit on the page's HTML file. If MinHitsPerVisit or MinHitsToPageRatio are greater than 1, then the second visit would be ignored?

Bllacky commented 3 years ago

How would this feature interact with cached resources (CSS, js, images, etc.) and return visits? It seems like it might miss returning visits if resources are set to be cached for long periods. I.e. a first visit triggers 3+ hits for a page. Still, a second visit later in the day may only be single hit on the page's HTML file. If MinHitsPerVisit or MinHitsToPageRatio are greater than 1, then the second visit would be ignored?

In theory, I think you are absolutely right.

But I am not sure if that is how it will work in practice because I have never encountered such a situation. On my website, the first visit is usually 1 page - 50 hits, while second visits are usually 1 page 20/30 hits. But I never had 1 page - 1 hit from a valid visit. I often get 1 page - 1 hit from some IPs in Russia/China/Vietnam etc.

I suppose you could get 1 page - 1 hit wrong if all your resources were static and cacheable, but I am not sure if there are any modern websites which work like that. If there were, that's why we proposed for this setting to be configurable. My website for example has many microservices and dynamic scripts, and it would be impossible to have 1 page - 1 hit.

visualperception commented 3 years ago

chuckhoupt wrote:

How would this feature interact with cached resources (css, js, images, etc) and return visits? It seems like it might miss returning visits if resources are set to be cached for long periods. I.e. a first visit triggers 3+ hits for a page, but a second visit later in the day may only be single hit on the page's HTML file. If MinHitsPerVisit or MinHitsToPageRatio are greater than 1, then the second visit would be ignored?

chuckhoupt, there are a couple or more ways you can check this. Firstly you could look in the stored statics file. e.g. awstats102020.domain.txt at the section titled: Host - Pages - Hits - Bandwidth - Last visit date - [Start date of last visit] - [Last page of last visit] [Start date of last visit] and [Last page of last visit] are saved only if session is not finished The 10 first Hits must be first (order not required for others)

This shows pages and hits for an IP. Where it is 0 or 1 pages and 1 hit that is likely a robot, And if the user agent does not contain a bot id then it also a potential bot. Also if that IP has accessed robots.txt it can definitely be considered a bot. So there is plenty there to check against. You only need to check after the robot detection has run and it says it wasn't a bot. Those are the ones we would like to be checked. From the same section you could take everything between BEGIN_VISITOR and END_VISITOR, sort them on Pages - Hits descending and you will get a list with all the lowest pages and hits first so you can stop processing once you hit the new conf file parameters for pages and hits. Just a few things to consider as suggestions. You may of course find a better way once you get into it and you will need to modify all relevant reports/stats datsbase etc. And consider that my investigations show something like 30% bad bots and getting rid of those is unlikely to generate nearly as many wrong detections so whilst it may not be 100% better it is likely to be only a small %error in detection so will improve the accuracy considerably. But this will come out in testing I think

visualperception

visualperception commented 3 years ago

@chuckhoupt

Any more thoughts on this?