Tatvic / RGoogleAnalytics

R Library to easily extract data from the Google Analytics API into R
228 stars 147 forks source link

Mismatch between pageviews on GA and the returned resultsof GetReportData #27

Open ghost opened 8 years ago

ghost commented 8 years ago

Hi,

I followed the instructions for linking the core API, and I seem to get results, but when I run:

query.list <- Init(start.date = "2016-05-01",
                   end.date = "2016-05-03",
                   dimensions = "ga:date,ga:pagePath",
                   metrics = "ga:pageviews",
                   max.results = 10000,
                   sort = "-ga:date",
                   table.id = "ga:xxxxx")
ga.query <- QueryBuilder(query.list)
mydata <- GetReportData(ga.query, token, split_daywise = T)

I get a mismatch between what I see on Google Analytics and the result from GetReportData. Any pointers?

Thanks and best wishing, I love your work!

Dan

BobbyBarbeau commented 8 years ago

Apologies if the following is too basic, but a few things off the bat that could result in discrepancies:

When I've experienced data mismatches, I've often found the source was related to sampling, table.id mismatch, filters/segments, or the like.

ghost commented 8 years ago

Hi Bobby, I imagine it has something to do with sampling, but I didn't manage to figure out how to turn this on/off. The image you attached is, unfornunately, not too instructive for me. I did check the table.id, and I made sure I was looking at the correct view in GA. In fact, I only have 1 view and it's not segmented, so it shouldn't be a problem. I'm not sure about your final possibility. Perhaps a pointer toward the sampling issue, and if that doesn't work, I will get more into the dimension thing. Best Wishes, Dan

BobbyBarbeau commented 8 years ago

Dan,

For more info on sampling, see https://support.google.com/analytics/answer/2637192?hl=en

Another way to see if sampling is an issue is to simply rerun your query, but drop the split_daywise argument. (split_daywise eliminates sampling for the most part by breaking a large data set into smaller daily data sets.)

Try rerunning your query like this:

mydata <- GetReportData(ga.query, token)

Are there any messages in R about the query being sampled? If there are, then the mismatch is caused by sampling.

If not, then neither GA interface nor the API should be using sampled data.

In terms of my last point about dimensions, rather than wrangling with the data via dplyr, it would be much easier simply to run the query without the date dimension:

query.list <- Init(start.date = "2016-05-01",
                   end.date = "2016-05-03",
                   dimensions = "ga:pagePath",
                   metrics = "ga:pageviews",
                   max.results = 10000,
                   #sort = "-ga:date",
                   table.id = "ga:xxxxx")
ga.query <- QueryBuilder(query.list)
mydata <- GetReportData(ga.query, token)

Again, if you see messages in R about sampling, then you'd need to rerun with split_daywise included.

But the above should match the pageview data reported in Behavior > Site Content > All Pages report in the GA interface if the data isn't being sampled.

HTH

EDIT: commented out the sort as that would throw an error since there would be no date dimension to sort on. You could just delete it, but I'm commenting it out just to explain my edit.

JerryWho commented 8 years ago

When I get strange results using the API I often use the Query Explorer (https://ga-dev-tools.appspot.com/query-explorer/) to double-check the results.