UpendraSingh / analytics-issues

Automatically exported from code.google.com/p/analytics-issues
0 stars 0 forks source link

Incorrect data returned when more than one dimension used. #91

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
--------------------------------------------------------------------------
NOTE: This issue tracking system is for developer products only.  If you
are not a developer/programmer and have a problem with a Google web site,
please report the problem to the appropriate group.  More information can
be found here: http://www.google.com/support/
--------------------------------------------------------------------------

Name of API affected:
Analytics Data Export API

Issue summary:
I am seeing issues with the data being returned when I run a query on more than 
one dimension at a time.  Specifically I am using two custom variables.  When I 
run the report with ga:customVarValue2 only for the entire month of June I get 
38726 rows (I need to run the query 4 times in order to get all the data since 
10000 rows is the max in any query).  So after running this I know that there 
are 38726 unique entries for this custom variable.  I then run the report again 
combining ga:customVarValue2 and ga:customVarValue1.  When I do this I only get 
12918 rows returned.  I know that there is more data available as I have run 
the report for each variable on their own.

Steps to reproduce issue:
1. Run a query using ga:customVarValue2
2. Count the number of rows returned
3. Run the same query using ga:customVarValue2 and ga:customVarValue1 combined
4. Count the number of rows returned and compare the data.

Expected output:
I would expect the exact same number of rows to be returned when I query custom 
variable 1 and 2 as when I query customer variable 1 on its own.

Actual results:
There are 25808 values missing when I run a combined report with two custom 
variables.

Notes:
files attached show the difference in the data being returned.  The example 
without the country and city has the correct data.  When you compare the first 
few rows of both images there are loads of entries missing from the combined 
query with the country.  For example the first three entries 100008, 100009 and 
100037 are missing from the second report.

I have also noticed that if I run this query for 1 day (June 1st) and I add a 
filter to only show data from "IRELAND-GALWAY" I get 89 rows of data.  When I 
run the exact same query for 1 week (June 1st to June 7th) I only get 61 rows 
returned.  And when I run it for the entire month of June I only get 29 rows of 
data.  Surely I should be getting at least 89 rows for the month if one day on 
its own has 89 rows.  If I add up all the visits in this query of 29 rows 
returned I get 5838 visits.  But there according to the stand alone query of 
just the location (ga:customVarValue1) there should be 6014 visits for 
IRELAND-GALWAY.  This means there are 178 visits not being counted in the 
combined report.  

Original issue reported on code.google.com by fcdevt...@fmr.com on 19 Jul 2010 at 9:25

Attachments:

GoogleCodeExporter commented 8 years ago
After reading through some other issues I see a comment about the data being 
sampled.  Can you please tell me how I can run a query using the API that 
consists of real data and not sampled data.  We need to run some reports but we 
cannot get accurate data as the sampling is so inaccurate.

Original comment by fcdevt...@fmr.com on 19 Jul 2010 at 9:29

GoogleCodeExporter commented 8 years ago
Using multiple dimensions often reduces the numbers returned, because using 
multiple dimensions increases the number of "buckets" a visitor must fall into 
in order to be part of the result set.

For instance, if out of all your site traffic in a particular timespan 100 
visitors came to your site via search keywords in a particular timespan, and 
100 visitors visited more than one page, running a query with "secondPagePath" 
and "keyword" as dimensions would not show "at least as many" as if you'd 
searched for keyword alone.  Results would only include people who had come to 
the site via search, and had visited more than one page.  People who direct 
linked, used bookmarks, or left after a single pageview would not be in the 
results.

In your case, unless CV1 and CV2 *ALWAYS* happen together, the numbers you see 
will always be smaller when queried together for this reason.

If this isn't your case, please provide more detail and we can investigate 
further...

Original comment by api.alex...@gtempaccount.com on 19 Jul 2010 at 5:05

GoogleCodeExporter commented 8 years ago
Thanks for the fast reply.  I understand your point about multiple dimensions 
reducing the numbers returned but in this case every pageview has a value for 
both custom variables.  We know that they always happen together as they are 
being set manually for each request that is sent to google.

I suspect the the issue is that the data is being sampled.  I know it is in the 
GUI tool but I expected the API to provide all the data.  Since submitting this 
query I read that the data returned by the API is also sampled.

I have come up with a number of "solutions" to the problem but each time I 
think I have a resolution I realise that it will not work.  1st option was 
running queries using the API as opposed the the GUI tool.  Next is to set up 
profiles for each of the values of CV2 on the GUI tool and then look at the 
auto generated report for CVI.  But we cannot do this as you cannot create 
filters on Custom Variables.  Then we though about setting the User Defined 
variable to be the same as CV2 so that we can apply filters to that but because 
it is deprecated we dont know how long this will work for and also it means 
that we will not have a full months worth of data until the end of August.

Do you have any suggestions as to how we can get the full raw data while 
combining 2 CVs?  I am prepared to try anything at this stage!

Original comment by fcdevt...@fmr.com on 20 Jul 2010 at 9:44

GoogleCodeExporter commented 8 years ago
Removing an obsolete label that was used when these issues were in the 
gdata-issues project.

Original comment by jrobbins@google.com on 21 Jul 2011 at 10:04