airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.68k stars 4.02k forks source link

Source: Bigquery - Response too large #6533

Open harshithmullapudi opened 3 years ago

harshithmullapudi commented 3 years ago
## Enviroment OS Version / Instance: Ubuntu 18.04.2 Memory / Disk: 16Gb Deployment: Docker Airbyte Version: 0.29.22-alpha Step: Initial Sync ## Current Behavior *Tell us what happens.* ## Expected Behavior I am trying to sync 30 GB of data from BigQuery (source) to Postgres. The connection failed with com.google.cloud.bigquery.BigQueryException: Response too large to return. Consider specifying a destination table in your job configuration. ## Slack conv https://airbytehq.slack.com/archives/C01MFR03D5W/p1632910265432200 ``` Based on https://cloud.google.com/bigquery/docs/error-messages responseTooLarge 403 This error returns when your query's results are larger than the maximum response size. Some queries execute in multiple stages, and this error returns when any stage returns a response size that is too large, even if the final result is smaller than the maximum. This error commonly returns when queries use an ORDER BY clause. Adding a LIMIT clause can sometimes help, or removing the ORDER BY clause. If you want to ensure that large results can return, you can set the allowLargeResults property to true and specify a destination table. ``` ## Steps to Reproduce 1. 2. 3. ## Are you willing to submit a PR?

Remove this with your answer.

bayersglassey-zesty commented 2 years ago

We're running into this as well, and I believe the issue is that:

My understanding is that a BigQuery source in Airbyte always uses this class: https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-bigquery/src/main/java/io/airbyte/integrations/source/bigquery/BigQuerySource.java

...which always uses this class to actually make queries: https://github.com/airbytehq/airbyte/blob/master/airbyte-db/db-lib/src/main/java/io/airbyte/db/bigquery/BigQueryDatabase.java

...which always uses the old-school BigQuery API, which is not great for data transfer.

This page shows an overview of all bigquery's APIs: https://cloud.google.com/bigquery/docs/reference/libraries-overview

The first one is the old-school one I just mentioned. But there are others, in particular a fancy streaming one, the "BigQuery Storage API":

This API exposes high throughput data reading for consumers who need to scan large volumes of managed data from their own applications and tools. The API supports a parallel mechanism of scanning storage and exposes support for leveraging features such as column projects and filtering.

Here is a link to its high-level docs: https://cloud.google.com/bigquery/docs/reference/storage

Here is a link to its low-level docs (in particular, showing basic Java usage): https://cloud.google.com/bigquery/docs/reference/storage/libraries

Perhaps we could have Airbyte's BigQuerySource class be configurable in whether it uses the old-school BigQuery API (i.e. our existing BigQueryDatabase class), or the "BigQuery Storage API" (for which we would need to write a new class).

Alternatively (or in addition to) that, perhaps we could make Airbyte's existing BigQueryDatabase class more configurable.

This ticket already mentions this (which I think comes from some BigQuery documentation):

you can set the allowLargeResults property to true and specify a destination table.

Well, it would be nice if we could configure those. At the moment, Airbyte's BigQueryDatabase class configures its query using this method:

  public QueryJobConfiguration getQueryConfig(final String sql, final List<QueryParameterValue> params) {
    return QueryJobConfiguration
        .newBuilder(sql)
        .setUseLegacySql(false)
        .setPositionalParameters(params)
        .build();
  }

...where the QueryJobConfiguration class (and its builder class, QueryJobConfiguration.Builder) is from Google:

…and that builder has some very relevant methods to our issue:

…however, airbyte’s BigQueryDatabase#getQueryConfig method currently does not call those, nor does it give its caller a way for those to be called, because it returns the builder’s .build() method!

So, I don't know enough about Airbyte's code yet to know how configuration options for sources can be added. But it seems to me like low-hanging fruit for this ticket would be to add some for those two options; and then a stretch goal would be to implement a new database class which uses the fancy new "BigQuery Storage API".

Edit: actually, I'm not sure if setting a destination table would really be useful, since the whole point of Airbyte is not to shuffle data around inside BigQuery (or any other single database), but rather to transfer data between databases... So I think maybe looking at that "BigQuery Storage API" is the only reasonable approach to solving this ticket.

drgsec1234 commented 1 year ago

Hi, we have the same issue since this is not something that can be changed by default in BQ to work, I think the solution is to use BigQuery Storage API Is there any progress on this issue?

noah-abramson commented 1 year ago

Hi - same issue. Would love some 👀 on this issue!

bayersglassey-zesty commented 1 year ago

In my comment above, I mentioned Airbyte currently seems to be using BigQuery's "legacy streaming API", and that there is a newer "Storage API" (here is its "Write API": https://cloud.google.com/bigquery/docs/write-api) which lifts many limitations of the legacy API.

Another workaround we've used in our code which transfers large amounts of data into BigQuery is to first store the data in GCS as CSV, and then use a QueryJob to get it into BigQuery. This works because query jobs do not have the same limitations as the legacy streaming API. Google has a page explaining how to load from a CSV in GCS into BQ: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv A quick search seems to show that Airbyte already has support for writing to GCS: https://docs.airbyte.com/integrations/destinations/gcs/ ...so perhaps a sort of wrapper Destination class could be written, which uses the existing GCS destination class, but then once all data is written, kicks off a QueryJob in BQ. Not sure if that kind of pattern is possible in Airbyte, though (load everything a row at a time, then kick off a separate third-party copy job at the end).

TimStrietzel commented 1 year ago

Hi, we have the same issue since this is not something that can be changed by default in BQ to work, I think the solution is to use BigQuery Storage API Is there any progress on this issue?

Hello team, since we are also running into the same issue I was wandering, if this is part of the near-term roadmap? Can you give a quick guess and when the new BQ API version might be enabled in this regard?

Thank you very much!

edwardrpool commented 5 months ago

Hello All,

We are also running into this issue. Curious if this is going to be addressed in the near future?

Thank you

rafaelstalse commented 5 months ago

I'm having the same problem, do we have any updates on this problem?

gatsby003 commented 2 months ago

@bleonard Hey , facing the same issue with Bigquery source. I am willing to contribute to fix this issue, will you be able to help understand where should I begin to figure out a fix for this. This will help get a head-start on the issue ! Thanks