GoogleCloudDataproc / spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
Apache License 2.0
374 stars 196 forks source link

BigQuery Storage API throws a lot of 499 errors #285

Closed juliankeppel closed 3 years ago

juliankeppel commented 3 years ago

Hi,

we use the connector (v 0.18.0) to download data from BigQuery to AWS S3 with Spark on EMR. Since a few days, we faced the problem that our jobs take much much longer time to finish (without any code changes or changing amounts of data, etc.). Some of the "bigger" jobs even didn't finish at all.

For example, see this Spark UI screenshot: image

Normally, the total job takes maximum 1,5 hours to complete. The tasks that do the actual download seem to be very slow.

We saw in the Dashboard of the Google BigQuery Storage API that there are a lot of 499 errors: image

Do you have any idea, what could be the problem? Were there any changes in the Storage API few days ago, or something similar?

We can't find any Errors in the Spark app logs.

sheelc commented 3 years ago

We are seeing the same issue on on our Spark environment (spark deployed on K8S; also an AWS environment). Connector version 0.17.3, Spark 3.0.1

Table size is about 140 GB and after about 25% of the way through, the throughput on reading records goes down dramatically and isn't able to finish the job. It used to complete in about 15 minutes but after 12 - 16 hours, we're still at 25% of the way through.

No errors in the spark app logs and issue started in the last few days (12/9/20 or 12/10/20). BQ Dashboard similarly shows increased errors on BQ 'ReadRecords'. We also haven't made any configuration or any code changes, was running fine for the whole month prior.

kmjung commented 3 years ago

I believe 499 translates to a gRPC CANCELLED status. Let me see if I can figure out where this is measured.

It would be helpful if one of you could share a session ID with me -- that's a string of the form projects/{project ID}/locations/{location ID}/sessions/{session ID} and should be available in job logs. You can use the BigQuery public bug tracker if you'd prefer to do that privately.

juliankeppel commented 3 years ago

I can't find such a string in our logs, sorry! Would it help if I send you project-ID and dataset-ID? We also opened an issue at GCP support, can you also have a look into that? We added those informations there.

kmjung commented 3 years ago

Yes, I can look into the issue through GCP support.

juliankeppel commented 3 years ago

It's case #26211967. Anything else you need so far?

kmjung commented 3 years ago

Not at this point. I can reach out through the support case if we need more details.

sheelc commented 3 years ago

(In case helpful for more sample points, we added the session string to our support case that we had opened through our support partners and they said they've sent it over to you!) Really appreciate taking a look!

asiunov commented 3 years ago

@kmjung , thank you for solving this issue. In addition to the problem itself, do you have any insights on how we might easier detect this issue? This was quite challenging. Spark (bq-connector) logs or spark UI did not tell anything, it was just like a long task processing, but hard to understand that this is stuck on reading from BQ and not on actual data processing. We found the errors on google console just accidentally, and we may do not have an easy access to the UI. Maybe this is an improvement which can be done on spark-bq-connector side, like logging the error? I'm not sure whether bq connector gets anything back after this error.

kmjung commented 3 years ago

Thank you for solving this issue.

@asiunov has the issue been resolved on your end? We are continuing to investigate at the moment.

asiunov commented 3 years ago

We did not try recently, we've paused retries for the job while waiting a resolution from your side.

kmjung commented 3 years ago

I see. An update for folks who are following this issue, then: we have scaled down an internal workload that we feel may have been causing CPU contention on the storage API, but have not reached a root cause determination yet. If you have a recent repro (after noon Pacific on 12/17) that we could investigate -- particularly one that was much slower than expected but completed or failed without being cancelled -- then it would be useful if you could share session or stream IDs from the job.

sheelc commented 3 years ago

(I work with @asiunov)

As per the instructions we received from support, we restarted the job and are seeing the same symptoms (ingests ~2m records at normal speed and then slows to a halt / very few records every once in a while). We've kept these jobs running (3 table pulls, 3 spark clusters) and sent the session IDs over in the support request.

kmjung commented 3 years ago

Thanks, I'm looking at the sessions now.

kmjung commented 3 years ago

Our back-end logs suggest that something is cancelling the ReadRows requests shortly (10-15 mins initially, and then subsequently less) after they're created. I've seen this error before in cases where the HTTP/2 stream underlying the gRPC stream is closed unexpectedly, and there is retry logic in the client which is triggered in this case. I'm looking for a way to figure out on your end if this is what's happening.

kmjung commented 3 years ago

@sheelc @asiunov would it be possible for you to enable verbose gRPC logging on your Spark workers? I have not done this myself -- maybe @davidrabinowitz can help here -- but you could try something like this: https://stackoverflow.com/questions/50243717/grpc-logger-level

juliankeppel commented 3 years ago

Thank you guys for support and investigating on that problem!

Because the affected data is very important for our product, we've built a workaround solution with a intermediate export step to Cloud Storage and download the data with help of this connector from there.

But the solution with direct streaming from BigQuery is still our favorite solution and we would be happy to go back to it again if it's reliable again.

I tried to reproduce the problem on client side with this simple Python script:

from google.cloud.bigquery_storage import BigQueryReadClient
from google.cloud.bigquery_storage import types

N_ROWS_TO_FETCH=10000

project_id = "rin-dt-ga-bq-export-prod"
dataset_id = "95242197"
table = f"projects/{project_id}/datasets/{dataset_id}/tables/ga_sessions_20201214"

client = BigQueryReadClient()

requested_session = types.ReadSession()
requested_session.table = table
requested_session.data_format = types.DataFormat.AVRO

session = client.create_read_session(
    parent=f"projects/{project_id}",
    read_session=requested_session,
)

print(f"Session name: {session.name}")

reader = client.read_rows(session.streams[0].name)

rows = reader.rows(session)
results = []

for n, row in enumerate(rows):
    if n < N_ROWS_TO_FETCH:
        results.append(row)
    else:
        break

print("Finished")

It's interesting that if I set N_ROWS_TO_FETCH to a pretty small number (for example 1000), it runs without any problems. If I set it to 10000 (which is still pretty small), it seems to run "forever" (waiting for 10 minutes now). And one more interesting thing: I tried this on Monday already, and I got an error similar to this immediately: error 8: RESOURCE_EXHAUSTED.

Not 100 % sure if this has something to do with our problem, but I guess it has. Maybe it helps. The corresponding session ID is: projects/rin-dt-ga-bq-export-prod/locations/eu/sessions/CAISDFpfcjNhRXVJckRtahoCZWwaAndvGgJlZhoCZWoaAndh.

sheelc commented 3 years ago

would it be possible for you to enable verbose gRPC logging on your Spark workers?

Posting on behalf of @asiunov but he thinks he's got a route to get this enabled so that we can get more logging on this front.

kmjung commented 3 years ago

Thanks for the update!

asiunov commented 3 years ago

Alright, now we have a solution to get debug gRPC logs in case we need this. We ran multiple jobs w/ and w/o gRPC logs and the jobs have completed. So, seems like something is changed on BQ side.

kmjung commented 3 years ago

@asiunov to clarify, is the fact that the jobs have completed a sign that things appear to have returned to the previous state?

asiunov commented 3 years ago

If by the previous state you mean before the issue, yes, it is resolved for us. But I do not know whether you've fixed something, or we were just lucky this time to do not compete for server resources with other BQ users.

sheelc commented 3 years ago

Worth mentioning is now that we've caught up on 9 table pulls (9 days lag since the issue first manifested), I have re-enabled it to run daily and pull the new BQ table each day at 12 am UTC -- I can report back here if we see an issue, but hoping that there were some changes that have created stability (and it wasn't just that we were lucky)

kmjung commented 3 years ago

hoping that there were some changes that have created stability (and it wasn't just that we were lucky)

Yes, we've rolled back a change here on the BQ side, which I expect is the reason that normal service has returned for your project. We'll be rolling back the change for all projects later tonight.

sheelc commented 3 years ago

Excellent! That's great news, thanks for all the help and fixing on this @kmjung

kmjung commented 3 years ago

@juliankeppel can you give this another try and see if your workload succeeds?

juliankeppel commented 3 years ago

@kmjung Yes, we will check next week and let you know if it works as soon as possible.

josges commented 3 years ago

@kmjung Hi, I am working with @juliankeppel and just ran our workload successfully. Thank you for the help and the fix!

kmjung commented 3 years ago

Resolving, since it sounds like the issue has been addressed.

emkornfield commented 3 years ago

@sheelc we believe we have fixed the issue with the feature that caused this issue. We are planning on starting a rollout out soon. If you have a test project that we can enable first to ensure it doesn't break you again, could you open up a new case and we coordinate on how to make that happen?

sheelc commented 3 years ago

@emkornfield awesome, thanks for reaching out. I'll open a new case for us to coordinate but yes, I think the customer would be amenable to us using a test pipeline that pulls the data.

emkornfield commented 3 years ago

@sheelc just wanted to ping you here, since the open case seems to have stalled.

sheelc commented 3 years ago

@emkornfield thanks for pinging here -- it seems that I was left out of the support thread somehow and it stalled. Just replied there with the project id and dataset id that we were using when we last ran into issues

sheelc commented 3 years ago

@emkornfield We started experiencing this issue again consistently over the past 3 days. We currently have a job that is exhibiting the same behavior (~2k records / second). I've reopened the case but wanted to ping here as well for faster visibility as it's affecting our customers again. Thanks for the help here

emkornfield commented 3 years ago

@sheelc OK, I'll start the process of moving to back to the old backend

sheelc commented 3 years ago

@emkornfield Excellent, thank you!

emkornfield commented 3 years ago

@sheelc lets communicate more on the case. I think something else might be going on but need some more details.

sheelc commented 3 years ago

@emkornfield The case number that was reopened is 26674714