18F / analytics.usa.gov

The US federal government's web traffic.
https://analytics.usa.gov
Other
727 stars 195 forks source link

"Visits to all domains over 30 days" CSV Shows Duplicate Data #410

Closed namimody closed 2 months ago

namimody commented 7 years ago

BUG

Current Behavior

Seems improbable that many agencies would have the exact same number of visits, pageviews, users, exits to their sites...

Download from 11.21.16 (pink highlighted columns have improbable data): https://docs.google.com/spreadsheets/d/11wvYC1HyRZ3E5yZs1zj_etGqi8ysAkFu3V5g2MttoHA/edit?usp=sharing

image

Desired Behavior

Data collection methods should be revisited for number of visits, pageviews, users, exits.

Steps to Replicate

1) Go to https://analytics.usa.gov/data/ 2) download "Visits to all domains over 30 days" CSV 3) sort "number of visits, pageviews, users, and exits columns from A-Z -- note the duplicate field entries.

Why This Matters

This data is only helpful if it is correct!

konklone commented 7 years ago

@tdlowden Is this an outcome of sampling?

tdlowden commented 7 years ago

Yes, I actually have been working with our Google vendors today trying to assess. As of this morning, our GA account is sampling at astronomical levels (using <1% of sessions) in places we could always get a 100% of sessions report previously. Not sure if there is something going on at Google, but this issue apparently had already been in place when the reports were run for our data downloads last night. We're looking to find the cause.

smarina04 commented 7 years ago

It's due to sampling.

On Mon, Nov 21, 2016 at 2:42 PM, Eric Mill notifications@github.com wrote:

@tdlowden https://github.com/tdlowden Is this an outcome of sampling?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/18F/analytics.usa.gov/issues/410#issuecomment-262045107, or mute the thread https://github.com/notifications/unsubscribe-auth/ADIpx67IYsVKGtvt0Ym4xqABQ4enbf_yks5rAfQsgaJpZM4K4m1H .

tdlowden commented 7 years ago

As an update on this, it looks like the problem was still occurring when the reports were run again last night, but from what I can tell in the GA interface, the issue seems to have been rectified as of 8:20 am. Hopefully, tomorrow our reports won't be subject to a sample, or at least not one as great as the sampling that was happening.

gbinal commented 7 years ago

First off - belatedly, thank you very much for the issue, @namimody. I wanted to check in with a brief update.

We've known that this is an outstanding issue and unfortunately, it's not resolved. My understanding is that we may not have any luck getting GA to turn off sample this far down the the rabbit hole. We'll have to continue to balance where to draw the line including more results in the data downloads and not including sampling.

@tdlowden - in the meantime, what are your thoughts on adding a sentence to the footer, and possibly including an * pointing to said disclaimer next to the datasets that have this as an issue?

levinmr commented 2 months ago

Closing stale issue.