Destination BigQuery - wrong behavior after upgrade

animer3009 commented 1 year ago

Connector Name

destination-bigquery

Connector Version

2.0.3

What step the error happened?

During the sync

Revelant information

We used Airbyte 0.50.6 with several sources and one destination bigquery 1.6.0 ...

After upgraded to Airbyte 0.50.26 with destination bigquery 2.0.3 ...

We noticed that logics are changed for destination. It started moving existing objects data to airbyte_internal schema. That is ok! For the beginning it was scary, but after realising that it keeps old objects and moves data from old to new structure we were happy. Just there is some issue: We request raw data as usual. But at the same time, destination BigQuery started object normalization in the old schema. So we have an old schema now with old objects (old data as well) and with new normalized objects (including new data). Same time we have raw data in the new airbyte_internal schema.

So we do not request normalization and it anyway creates it...

Screenshot at Sep 05 19-06-30

Relevant log output

No response

Contribute

[ ] Yes, I want to contribute

animer3009 commented 1 year ago

I just noticed one new issue. Like I see that top version of connector is 2.0.6 https://docs.airbyte.com/integrations/destinations/bigquery/?_gl=1*hl041d*_gcl_au*MTAzMTY1NzQ3OS4xNjkzOTI5NzU1

Airbyte do not show the new versions.

Screenshot at Sep 07 13-58-32

animer3009 commented 1 year ago

My last message was solved.

Hesperide commented 1 year ago

Hi @animer3009! Thank you for reaching out and opening this issue. If I understood right, what you are seeing is correct behavior. The new version of BigQuery you are using includes Destinations V2, which you can learn more about here and here. You should now expect the following:

Old raw tables: These are left untouched, but the data will no longer be kept up-to-date. We recommend dropping these tables once you've adjusted your downstream dashboards to the newly formatted data.
New raw tables: These are now in the airbyte_internal schema, as you discovered.
New final tables: These are in the destination namespace, and will have overwritten your previous final table (if you had one).

We now require the creation of final tables. There is no longer a mode where we only create raw tables. We strongly recommend you use final tables over the raw tables, as we cannot guarantee the same level of stability for raw tables as for final tables in your destination schema. You can learn more about this here.

That said, let me know if you have any reservations on using the final tables over the raw tables. The team and I are all ears, and super interested for any additional feedback you might have to share.

animer3009 commented 1 year ago

Hi @Hesperide Thank you for feedback and source info.

FYI - We use Airbyte API calls to create sources and connections. Destination we have the same, just time by time upgrade version. We use a full refresh append to keep historical data.

1) That's right. I guess in "adjusted your downstream dashboards" you mean endpoint apps. So we have already adjusted. If you mean to copy them in a new structure, that's also OK as almost all of them are copied. Just we have a huge amount of tables, we keep data by year as well... So doing this manually is almost impossible (I mean drop). 2) yep. 3) Nope, we never had them (final - normalized right?).

New raw tables are not in the destination schema as you mentioned. I really do not understand why you removed the possibility to have the option to get just raw data. Why is it necessary? Honestly we use an Amazon seller partner as a source connector. Time after time happens when amazon changes key names or add/remove keys... So how will you handle this in a flattened tables case? Keep in mind that keep raw + normalize is also duplicated data...

We definitely need to keep our raw tables for a few reasons. one I already described above. Also it is comfortably dynamically parsed data. We already have built our stuff based on that logic...

So please keep the possibility of requests just raw. It gives flexibility.

animer3009 commented 1 year ago

At last, how to determine which version will be safe for upgrade to keep guaranteed raw data integrity? But anyway, normalizing the table also takes additional resources... Today already stuck Airbyte server cuz over CPU usage and we increased resources for it...

It is easy anytime flat table from raw. But not as easy do it reverse.

animer3009 commented 1 year ago

@Hesperide What we must expect if we will drop existing normalized/final table? Airbyte will create it again from raw, including full data/data from the start?

animer3009 commented 1 year ago

@Hesperide any feedback?

This is really important for us.

animer3009 commented 1 year ago

P.s. looks lite it starts additional containers for normalization... So it need additional resources :(

animer3009 commented 1 year ago

@edgao @jbfbell @evantahler @cynthiaxyin Please guys say something :)

Hesperide commented 1 year ago

Hey @animer3009 . Sorry for the late reply, and for the ongoing issues you are experiencing. A few more questions:

New raw tables are not in the destination schema as you mentioned. The raw tables on destination-bigquery version 2.0+ should all be in airbyte_internal. There should be no raw tables in any other schema. If this isn't the case for you, where are they being stored?
There should be 2 tables created for each Airbyte stream. One raw table, and one typed + deduped table in the destination namespace. Are you seeing more than 2?
What version of the Airbyte platform are you running?

animer3009 commented 1 year ago

Hi @Hesperide,

Exactly, new raw tables are stored in airbyte_internal, normalized ones are in destination schema and old raw is in destination schema as well. Looks like a misunderstanding. I meant the same, just noted.
yeah, plus I see old raw tables in destination schema, but as you and documents describe, this is normal. As Airbyte will never touch old raw ones.
I have a version in the issue description. Platform version is 0.50.26 and Bigquery destination connector is 2.0.3 . I wish to upgrade the platform, including the destination connector. I am not sure if that is safe for me (as I wish to have new and old raw kept untouched - just get new data in new raw tables).

I got answers about issues described in this ticket, So now to avoid any misunderstanding I complain:

we have a lot of old raw tables, for first I need to be sure that new raw tables definitely include all old raw data (my checks shows that it is) and need some tool/option to start dropping them. As I mentioned we do API calls dynamically to create sources/connections. Our destination prefix includes some report variables and year. As you understand, for example 2022 year will never appear in the prefix, so airbyte will never copy data to the new raw table. This means that we need to sort them and selectively delete old raw tables. We have a really big amount of them. manually dropping is impossible. But anyway we can handle that. It is ok. Just need to be sure that new raw tables will definitely contain old raw tables data.
We use a full refresh + append. So asking that support/guarantee keep raw in all versions . As I described we definitely need to use raw. First, sometimes amazon (source) changes keys and we do not wish to miss something. Also we adapted to parse raw data. For us it is comfortable and acceptable.
We wish/ask to avoid table normalization. First normalization needs to start with additional containers, so it needs additional resources. We already added additional 1/2 resources, as airbyte started crashing after upgrade... also as I mentioned we need just raw data, it saves time and resources. Also data is not doubled in raw and normalized tables. From raw it is possible to normalize anytime. Reverse it is more hard to do, as data can be deduplicated there with non acceptable logics for us.

To help understand how important (for us) stuff I am talking about, I am providing a screen of currently running stuff in our Airbyte. lol Screenshot at Sep 13 20-16-07

Each has its own source with different variables and prefixes as well. Screenshot at Sep 13 20-43-54

animer3009 commented 1 year ago

Here you can see how upgrade effected on CPU resources: Screenshot at Sep 13 20-56-04

Hesperide commented 1 year ago

Thanks for all the feedback! Will be reviewing with the team. To some of your questions:

Both your platform and BigQuery connector versions appear fairly recent. You can upgrade your BigQuery connector to the latest version, as we've included some performance improvements in our most recent releases.
There are no more breaking changes since you updated to v2.0+. Upgrading should be safe. Raw tables in airbyte_internal should already contain all of your old raw data, as well as new data that has been synced since you upgraded to v2.0+.

animer3009 commented 1 year ago

@Hesperide Thank you for your feedback. We upgraded to platform 0.50.29 and destination 2.0.8.

Waiting for additional feedback.

Thank you.

evantahler commented 1 year ago

@animer3009 - can you break that CPU chart down for us? Are the increases of CPU consumption coming from the platform, source, or destination containers? We think that Destinations V2 shouldn't have any additional CPU load, and if does, that's a bug we should fix.

animer3009 commented 1 year ago

Hi @evantahler , After yesterday upgrade cpu usage looks the same. Just as resources increased, peaks look lower.

Screenshot at Sep 14 12-09-23 Screenshot at Sep 14 12-13-38 Screenshot at Sep 14 12-15-56 Screenshot at Sep 14 12-21-18 Screenshot at Sep 14 12-21-44 These last two are captured in a small time gap. LMK if you wish to run the specific command.

jbfbell commented 1 year ago

@animer3009 based on your screenshots it looks like the airbyte-worker container seems to be consuming the most CPU. This is a platform container and is responsible for managing the source and destination containers. Is it possible that you now have more simultaneous syncs? My theory being that if we're creating the final typed tables in v2 this is bit more work than writing to the untyped tables in the airbyte_internal schema and would take a bit longer than not creating them, and if you're using a cron to schedule syncs the timing might overlap now. In your screenshot it looks like there are 5 active read and 5 active write containers which I assume means there are 5 simultaneous syncs happening. This should be fine but would potentially explain the rise in CPU usage on the worker container.

animer3009 commented 1 year ago

Hi @jbfbell , Nope there was less syncs as usual. Nope we does not use cron. We create Source, add Connection via API call and request sync at the same time as well. After we monitor job status and delete source and connection as well. We have limit in .env file and that because:

MAX_SYNC_WORKERS=5 MAX_SPEC_WORKERS=5 MAX_CHECK_WORKERS=5 MAX_DISCOVER_WORKERS=5 MAX_NOTIFY_WORKERS=5

SYNC_JOB_MAX_ATTEMPTS=3 SYNC_JOB_MAX_TIMEOUT_DAYS=3 SYNC_JOB_INIT_RETRY_TIMEOUT_MINUTES=5 JOB_MAIN_CONTAINER_CPU_REQUEST= JOB_MAIN_CONTAINER_CPU_LIMIT=1 JOB_MAIN_CONTAINER_MEMORY_REQUEST= JOB_MAIN_CONTAINER_MEMORY_LIMIT=1g

TEMPORAL_HISTORY_RETENTION_IN_DAYS=7

Just not sure why it is not considering my config and sets limit 2GIB.

I am complaining exactly the same. You guys removed the option to not do normalization... Now it uses extra (doubled) data space as well... I do not need any normalized objects. This will save me time, space and resources (If not review other reasons)...

animer3009 commented 1 year ago

It will not make sense that default do normalization and add option to definitely not do final tables and keep as raw?

evantahler commented 1 year ago

@animer3009 thank you so much for your detailed feedback. I think this issue is confused into 2 parts:

The latest version of the platform is using more resources (CPU) than it used to. Can please make a new issue for that in the https://github.com/airbytehq/airbyte-platform repo. This repo is only about connectors, and I want to make sure that your issue doesn't get lost. Please include all the screenshots and logs from above.
You are requesting a raw-tables only sync. We can keep that converstation in this issue.

The vast majority of Airbyte users want final tables with real columns from the source(s). That said, at the moment, Typing and Deduping should be faster than it is, and we are tracking improvements in https://github.com/airbytehq/airbyte/issues/30418 and https://github.com/airbytehq/airbyte/issues/28380. Since storage is so cheap, once we get the time and CPU cost lower, we believe most Airbyte users will be satisfied. That said, T&D will always incur more warehouse cost than doing nothing, by leaving data in just the raw tables. We will bring that option back eventually, but we are focusing on the majority of users who want typed final tables first. I've added more information to the docs via https://github.com/airbytehq/airbyte/pull/30455, and describe our thinking a little more in #30454, which is where we will track adding back an option to skip making raw tables.

As you are an OSS user, you always have the option to remain on V1 destinations until you are satisfied with the options V2 destinations provide.

animer3009 commented 1 year ago

@evantahler Thank you for your response and resources. I will take on a new issue.

P.s. We have no duplicated data at all. For the beginning we build our process to avoid exactly that. For every new report we create a new source with different time ranges and options. We use a new source to create connection and sync data. We use a full refresh - append (as differently was not supported to keep historical data. Full refresh without appending deletes old data and not all syncs support incremental. Also some incremental keys were not acceptable for our logics).

We can't really go back. As we have a lot of objects. Legacy raw tables have no new data. New data is mixed in new raw tables. So we will need to identify new data and copy them... Also we wish to keep up to date our Airbyte and new version platform does not support V1. To imagine our situation I will provide some screen:

Screenshot at Sep 14 20-24-46

So https://github.com/airbytehq/airbyte/issues/30454 will really help us (I was not able to access slack. looks like it is internal).

p.s. I really do not understand, going with the described direction how you are going to handle data structure change. I mean, it happens when the source can get a key (column name) change. Or add/remove keys. Ok add is easy, you will add new. Remove looks normal as well, as you can just ignore that. What about change? It will look as new for you and you will add a new column, but you will not be able to deduplicate it. If deduplication is unique key based, it can be not acceptable in some cases. So having a raw option was the best I guess. Plus as I understand now Airbyte knows where to continue normalization. Scared that, to save space later you will start cleaning up raw tables.

evantahler commented 1 year ago

If deduplication is unique key based, it can be not acceptable in some cases.

Can you share more about this? Deduplication today is explicitly based on the primary key(s) as defined by the source. What else would you want to de-duplicate on?

Scared that, to save space later you will start cleaning up raw tables.

This is the point - we may very well choose to remove raw tables. It is no longer part of the contract Airbyte makes with our users. This is the point of the final tables - those will always remain. If you do not want de-duplication and you do want record history, disable deduplication in your sync... but still use the final tables please.

animer3009 commented 1 year ago

@evantahler Looks like stuff was changed, so I will need to ask additional questions and will make some notes based on Amazon Seller Partner and Amazon ads Source connectors...

In this case, I thought that deduplication would happen based on Cursor Field. Because it is not acceptable and in some cases does not exist, we decided not to use incremental - append stuff at all and we simulate incremental by creating source connectors, with different time ranges and variables. That is good, as for example Amazon SP source connector does not offer any PK-s. This means deduplication never happens. Right?

What about full refresh - append (we use that). Theoretically deduplication must never happen here as well. Even if PK will exist. Or it may be sync level (multiple streams). Right?

Now about normalization itself. I am not sure how the normalization process identifies columns. Does it use pre-defined schema definition if it exists? For example, in case of Amazon SP and Amazon ads source connectors predefined/hard coded schema exists. It is not acceptable for us. For the first, types do not look good. Second it is not up to date, not all columns included. You may know that amazon report schema can be slightly different based on account type as well. So if it uses predefined schema, it will be a big issue for us. We will miss some data and amazon has some limitations, how far you can go in past dates. It is different for various reports. Please have in mind that raw structure is independent from pre-defined one.

I see this is not relevant any more. Or just for destinations who do not support V2 yet. right? Screenshot at Sep 15 16-01-25

As we can't not use incremental and create source connectors/connections for each sync, we can't use this option as well. Right? Screenshot at Sep 15 16-04-39

In our case, if Amazon SP schema definition will be changed and we accidentally upgrade a source connector, how normalization will perform with option (Detect and propagate schema changes) Ignore? Pause is not an option for us. As we definitely need to get a report. Even if I will get a schema change notification, I am not sure what I need to do. Like just rename the old table OR? Propagate column changes only - will only add new columns? I mean will not remove old columns right? Propagate all changes - what does it do? It will change PK - s as well and will continue deduplication based on new PK? Or will it recreate a normalized table from the raw based on the new PK?

Ok, let's continue. Happens, when you will start to sync and process brakes. In that case you can get a little data already pulled. To not keep garbage (incomplete data) in the database (bigquery), We automatically identify those rows and delete them (from raw tables as we do not use normalized ones). Let's discuss the case when stream 1 failed but pulled 3 rows and stream 2 was successful and pulled fully 10 rows. If the schema will not have PK -s we may get duplicates in the normalized table as well. In our case we do not care, as we identify and delete those rows (please have in mind that Amazon SP has no PK-s). How will the Airbyte/BigQuery destination connector perform in this case? I mean, if we delete rows in raw, possibly it will be faster and a normalized table gets those rows. If we will wait and delete from normalized one and raw keep it, will it re copy those? I am trying to identify if we need to delete those rows from both objects... Or now if we will drop normalized tables How Airbyte/Bigquery connector will perform?

Next may I have one more question, but as for the normalized schema structure case for me it does not look clear, first I will wait for your response.

And once again, please please please implement that https://github.com/airbytehq/airbyte/issues/30454 opt-out. This will really help us and a lot of others avoid a headache.

I guess I am the only idiot who upgrades online production servers lol. So we faced these problems first I guess. I personally know several guys who use raw stuff. They are still on V1 :)

p.s.

disable deduplication in your sync

I was not able to find this option.

evantahler commented 1 year ago

I'm having a little trouble understanding your message, but maybe this information will help:

If you are not using de-duplication, Airbyte will not delete any records in either the raw or final tables
If your are using de-duplication:
- We use the primary key(s) to know which records should be de-duplicated. Some sources define this for you, and other source ask you to choose
- To know which record is the newest (e.g. the one we should keep), we use the value of the cursor. If a cursor is not available, or there is a tie, we use emitted_at
- Some sources define the cursor for you, and in other cases you are asked to choose one

evantahler commented 1 year ago

Regarding schema propagation, it seems like your use-case is schema-sensitive, so I would suggest disabling autopropagation, and periodically checking in the app if there are updates (or setting up notifications/weboohooks). We are working towards giving specific options (cc @nataliekwong) about how autopropagation handles additions and deletes distinctly, but today autopropagation:

Addtions make the new columns ASAP, but do not backfill old rows
Removals happen ASAP
Type-changes happen ASAP

animer3009 commented 1 year ago

@evantahler While I will figure out about this stuff, can you please share which lines I need to comment on to opt-out normalization? The goal is to locally build a bigquery destination connector and use dev version to avoid normalization process. After that I will delete already normalized tables to free up some space. I am really not proficient in java, cuz I am asking for that.

evantahler commented 1 year ago

If you want to disable Typing and Deduping from a new destination, you'll need to update updateTable() in BigQuerySqlGenerator.java. Or, keep your eye on https://github.com/airbytehq/airbyte/issues/30454

animer3009 commented 1 year ago

Hi @evantahler Here are my changes for my local build.

https://github.com/animer3009/airbyte/commit/943ef2e7dd07f757258ce6a3676214908729ee09

evantahler commented 1 year ago

I'm going to close this issue. I think you have what you ned to change the destination to your liking, and we are working on https://github.com/airbytehq/airbyte/issues/30454 to allow folks to opt-out of T&D

airbytehq / airbyte