argasi / google-bigquery

Automatically exported from code.google.com/p/google-bigquery
0 stars 0 forks source link

Retrieving queryResults takes a very long time. #199

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Running a group each by query generating rows > 1 mn. 
Sample Query : "SELECT DATE(event_time) as date, amplitude_id, e_3079332, 
COUNT(*) FROM [a_127277.127277_859488_2014_11] WHERE event_time >= '2014-11-15 
00:00:00' AND event_time < '2014-12-01 00:00:00' AND e_3079332 IS NOT NULL 
GROUP EACH BY date, amplitude_id, e_3079332";

2. setAllowLargeResults(true) and specifying a destination table.
3. The query finishes shortly in less than 15 seconds. But retrieving results 
takes around 28-30 minutes. 

What is the expected output? What do you see instead?
The query takes around 15 seconds and a destination table is created with the 
results inside it.
getQueryResults() returns 87500 rows in every response. There is a delay in 
trying to access the next page of results.  There is a delay in response being 
received from BigQuery's side for each paged response. 
The network at the client is not pegged at all. It's just waiting for the 
response to get back

What version of the product are you using? On what operating system?

Please provide any additional information below.
PROJECT_ID = 'amplitude-core'

Questions :
1) This is causing our queries to slow down tremendously. How can we expedite 
this?
2) Also, can we increase the size of number of rows returned in every paged 
response?

Thanks,
Nirmal

Original issue reported on code.google.com by nir...@amplitude.com on 16 Dec 2014 at 12:49

GoogleCodeExporter commented 9 years ago
This is something that we realize we need to address but is not going to be a 
quick fix. If you're dealing with large result sizes, your best option is 
generally to run an extract job to csv.

You also might consider parallelizing the data fetch operations. If you have 2 
million rows, you could run 10 parallel threads that each fetch different 
portions of the table (by using row indices). This could help drop the time to 
read the table by an order of magnitude.

Original comment by tig...@google.com on 19 Dec 2014 at 12:22