Streaming API (insertAll) takes too long to come back with a response

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Using the insertAll() method for a relatively large number of times. 
2. 95% of the requests finish in < 1 second, but a few requests take between 
15-16 minutes (900 - 960 seconds) and co-incidentally, it is always between 15 
and 16 minutes.  
3. Retrying also doesn't help as the next request has to wait for the same 
amount of time and there will be duplication of data in this case. 

What is the expected output? What do you see instead?
Output is as it is expected but at a 1000x slower rate. 

What version of the product are you using? On what operating system?
Ubuntu box an a Linux machine. 

Please provide any additional information below.
Project_ID : 513990544911
One of the requests which took a very long time : 
https://www.googleapis.com/bigquery/v2/projects/513990544911/datasets/amp_127890
/tables/127890_action_impression/insertAll?alt=json
The above request took around 947 seconds. 

This is causing a huge problem in our data pipeline and our systems start 
falling behind the realtime data coming in and never able to catch up. 

Any kind of assistance will be appreciated. 

Thanks,
Nirmal

Original issue reported on code.google.com by nir...@amplitude.com on 3 Dec 2014 at 10:46

GoogleCodeExporter commented 9 years ago

Couple of follow up questions:

1) Do you see a successful completion of the insertAll requests? Internally 
we'll time out any requests that take longer than a few seconds, so a 15-16 
minute delay is quite unexpected. If you're seeing successful completion, it's 
possible the delay is somewhere else in the system. I took a look at our 
request logs over the last few days and don't see requests (success or failure) 
that took in those time ranges.

2) Are you generating insertion IDs? This should atleast allow for a fast 
retry? We will deduplicate in this case. You can then set a timeout on your end 
and attempt fast retries.

Original comment by seanc...@google.com on 3 Dec 2014 at 11:03

GoogleCodeExporter commented 9 years ago

For some context, this is what our system sees for insertions to the above 
mentioned table over the last two days. The times are in milliseconds:

+--------------+---------------+----------+-------+
| ResponseCode | average_time  | max_time | count |
+--------------+---------------+----------+-------+
|          500 |        3550.0 |     5519 |     2 |
|          200 | 173.256051641 |     1601 |  3718 |
|          401 | 9.72727272727 |       36 |    22 |
+--------------+---------------+----------+-------+

Original comment by seanc...@google.com on 3 Dec 2014 at 11:06

GoogleCodeExporter commented 9 years ago

Hi, 

Thanks for the reply. 
1) Yes, I do see a successful completion of all the insertAll requests. I have 
tested other dependencies and all of them look correct to me. I will observe 
the network behavior while it happens to see if the rows are actually being 
sent to BigQuery. 
2) I am generating insertIds, but that doesn't help much in this scenario. I 
tried timing out the queries after 30 seconds and retry them with the same 
insertIds. The following requests also take that long (930-940 seconds) and 
that makes me believe it might just duplicate respective rows. 

The above problem had resurfaced around 3:00 pm PST (12-03) for the respective 
dataset - table combinations:
amp_127890  : 127890_action_impression
amp_127890 : 127890_action_read

Thanks.

Original comment by nir...@amplitude.com on 3 Dec 2014 at 11:15

GoogleCodeExporter commented 9 years ago

Our logging near the edge continues to indicate that we're returning fairly 
quickly (the pXXX_ms are the 50th, 95th, and 99.9th percentile latencies in ms, 
over the last 2 days):

+-------------------------------------------------------------------------------
------------------------------------------+--------------+--------+--------+----
-----+-------+
| request                                                                       
                                          | ResponseCode | p50_ms | p95_ms | 
p999_ms | count |
+-------------------------------------------------------------------------------
------------------------------------------+--------------+--------+--------+----
-----+-------+
| POST 
/bigquery/v2/projects/513990544911/datasets/amp_127890/tables/127890_action_clic
k/insertAll?alt=json HTTP/1.1      |          200 |    120 |    168 |     708 | 
 1581 |
| POST 
/bigquery/v2/projects/513990544911/datasets/amp_127890/tables/127890_action_clic
k/insertAll?alt=json HTTP/1.1      |          401 |      8 |    306 |     306 | 
    9 |
| POST 
/bigquery/v2/projects/513990544911/datasets/amp_127890/tables/127890_action_impr
ession/insertAll?alt=json HTTP/1.1 |          200 |    148 |    248 |    1396 | 
 3032 |
| POST 
/bigquery/v2/projects/513990544911/datasets/amp_127890/tables/127890_action_impr
ession/insertAll?alt=json HTTP/1.1 |          401 |      8 |     36 |      41 | 
   36 |
| POST 
/bigquery/v2/projects/513990544911/datasets/amp_127890/tables/127890_action_impr
ession/insertAll?alt=json HTTP/1.1 |          500 |   1581 |   1581 |    1581 | 
    1 |
| POST 
/bigquery/v2/projects/513990544911/datasets/amp_127890/tables/127890_action_read
/insertAll?alt=json HTTP/1.1       |          200 |    121 |    164 |    1571 | 
 2536 |
| POST 
/bigquery/v2/projects/513990544911/datasets/amp_127890/tables/127890_action_read
/insertAll?alt=json HTTP/1.1       |          401 |      8 |     11 |      11 | 
   12 |
| POST 
/bigquery/v2/projects/513990544911/datasets/amp_127890/tables/127890_action_read
/insertAll?alt=json HTTP/1.1       |          503 |     65 |     73 |      73 | 
    2 |
+-------------------------------------------------------------------------------
------------------------------------------+--------------+--------+--------+----
-----+-------+

I took a look at our logs from 2:30 - 3:10, and I do notice an interesting gap 
of ~15 minutes (starts at 2:44:45PM, ends at 3:00:23PM) where we receive no 
requests for any of the above targets. The times we log *should* be the first 
point at which we see the requests.

Is your system single threaded, such that one long call will block all 
subsequent calls? If not, and the problem is on our end, I wouldn't expect to 
see a gap like this.

Original comment by seanc...@google.com on 4 Dec 2014 at 5:22

GoogleCodeExporter commented 9 years ago

Hi, 

The problem hasn't resurfaced since yesterday. 

Very likely reason : I wasn't refreshing the bigquery_service object I was 
creating and the BigQuery Python API seems to be accepting the request, holding 
it for 15 odd minutes and then forwarding the request to BigQuery. A relevant 
message by the Python API would have helped narrow down the issue. 

I will update this issue if I see such behavior again. 

Thanks for the help.

Original comment by nir...@amplitude.com on 5 Dec 2014 at 1:39

GoogleCodeExporter commented 9 years ago

Ok. Closing out this issue for the time being. Definitely re-ping if you start 
to see issues again.

Original comment by seanc...@google.com on 5 Dec 2014 at 8:00

Changed state: WontFix

argasi / google-bigquery

Streaming API (insertAll) takes too long to come back with a response #196