Open harshithmullapudi opened 3 years ago
We're running into this as well, and I believe the issue is that:
My understanding is that a BigQuery source in Airbyte always uses this class: https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-bigquery/src/main/java/io/airbyte/integrations/source/bigquery/BigQuerySource.java
...which always uses this class to actually make queries: https://github.com/airbytehq/airbyte/blob/master/airbyte-db/db-lib/src/main/java/io/airbyte/db/bigquery/BigQueryDatabase.java
...which always uses the old-school BigQuery API, which is not great for data transfer.
This page shows an overview of all bigquery's APIs: https://cloud.google.com/bigquery/docs/reference/libraries-overview
The first one is the old-school one I just mentioned. But there are others, in particular a fancy streaming one, the "BigQuery Storage API":
This API exposes high throughput data reading for consumers who need to scan large volumes of managed data from their own applications and tools. The API supports a parallel mechanism of scanning storage and exposes support for leveraging features such as column projects and filtering.
Here is a link to its high-level docs: https://cloud.google.com/bigquery/docs/reference/storage
Here is a link to its low-level docs (in particular, showing basic Java usage): https://cloud.google.com/bigquery/docs/reference/storage/libraries
Perhaps we could have Airbyte's BigQuerySource
class be configurable in whether it uses the old-school BigQuery API (i.e. our existing BigQueryDatabase class), or the "BigQuery Storage API" (for which we would need to write a new class).
Alternatively (or in addition to) that, perhaps we could make Airbyte's existing BigQueryDatabase class more configurable.
This ticket already mentions this (which I think comes from some BigQuery documentation):
you can set the allowLargeResults property to true and specify a destination table.
Well, it would be nice if we could configure those. At the moment, Airbyte's BigQueryDatabase class configures its query using this method:
public QueryJobConfiguration getQueryConfig(final String sql, final List<QueryParameterValue> params) {
return QueryJobConfiguration
.newBuilder(sql)
.setUseLegacySql(false)
.setPositionalParameters(params)
.build();
}
...where the QueryJobConfiguration class (and its builder class, QueryJobConfiguration.Builder) is from Google:
…and that builder has some very relevant methods to our issue:
public QueryJobConfiguration.Builder setAllowLargeResults(Boolean allowLargeResults)
public QueryJobConfiguration.Builder setDestinationTable(TableId destinationTable)
…however, airbyte’s BigQueryDatabase#getQueryConfig
method currently does not call those, nor does it give its caller a way for those to be called, because it returns the builder’s .build()
method!
So, I don't know enough about Airbyte's code yet to know how configuration options for sources can be added. But it seems to me like low-hanging fruit for this ticket would be to add some for those two options; and then a stretch goal would be to implement a new database class which uses the fancy new "BigQuery Storage API".
Edit: actually, I'm not sure if setting a destination table would really be useful, since the whole point of Airbyte is not to shuffle data around inside BigQuery (or any other single database), but rather to transfer data between databases... So I think maybe looking at that "BigQuery Storage API" is the only reasonable approach to solving this ticket.
Hi, we have the same issue since this is not something that can be changed by default in BQ to work, I think the solution is to use BigQuery Storage API Is there any progress on this issue?
Hi - same issue. Would love some 👀 on this issue!
In my comment above, I mentioned Airbyte currently seems to be using BigQuery's "legacy streaming API", and that there is a newer "Storage API" (here is its "Write API": https://cloud.google.com/bigquery/docs/write-api) which lifts many limitations of the legacy API.
Another workaround we've used in our code which transfers large amounts of data into BigQuery is to first store the data in GCS as CSV, and then use a QueryJob to get it into BigQuery. This works because query jobs do not have the same limitations as the legacy streaming API. Google has a page explaining how to load from a CSV in GCS into BQ: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv A quick search seems to show that Airbyte already has support for writing to GCS: https://docs.airbyte.com/integrations/destinations/gcs/ ...so perhaps a sort of wrapper Destination class could be written, which uses the existing GCS destination class, but then once all data is written, kicks off a QueryJob in BQ. Not sure if that kind of pattern is possible in Airbyte, though (load everything a row at a time, then kick off a separate third-party copy job at the end).
Hi, we have the same issue since this is not something that can be changed by default in BQ to work, I think the solution is to use BigQuery Storage API Is there any progress on this issue?
Hello team, since we are also running into the same issue I was wandering, if this is part of the near-term roadmap? Can you give a quick guess and when the new BQ API version might be enabled in this regard?
Thank you very much!
Hello All,
We are also running into this issue. Curious if this is going to be addressed in the near future?
Thank you
I'm having the same problem, do we have any updates on this problem?
@bleonard Hey , facing the same issue with Bigquery source. I am willing to contribute to fix this issue, will you be able to help understand where should I begin to figure out a fix for this. This will help get a head-start on the issue ! Thanks
Hello, Facing the same issue, any updates on this problem and how it can be resolved ?!
Remove this with your answer.