Closed animer3009 closed 1 year ago
I just noticed one new issue. Like I see that top version of connector is 2.0.6 https://docs.airbyte.com/integrations/destinations/bigquery/?_gl=1*hl041d*_gcl_au*MTAzMTY1NzQ3OS4xNjkzOTI5NzU1
Airbyte do not show the new versions.
My last message was solved.
Hi @animer3009! Thank you for reaching out and opening this issue. If I understood right, what you are seeing is correct behavior. The new version of BigQuery you are using includes Destinations V2, which you can learn more about here and here. You should now expect the following:
airbyte_internal
schema, as you discovered.We now require the creation of final tables. There is no longer a mode where we only create raw tables. We strongly recommend you use final tables over the raw tables, as we cannot guarantee the same level of stability for raw tables as for final tables in your destination schema. You can learn more about this here.
That said, let me know if you have any reservations on using the final tables over the raw tables. The team and I are all ears, and super interested for any additional feedback you might have to share.
Hi @Hesperide Thank you for feedback and source info.
FYI - We use Airbyte API calls to create sources and connections. Destination we have the same, just time by time upgrade version. We use a full refresh append to keep historical data.
1) That's right. I guess in "adjusted your downstream dashboards" you mean endpoint apps. So we have already adjusted. If you mean to copy them in a new structure, that's also OK as almost all of them are copied. Just we have a huge amount of tables, we keep data by year as well... So doing this manually is almost impossible (I mean drop). 2) yep. 3) Nope, we never had them (final - normalized right?).
New raw tables are not in the destination schema as you mentioned. I really do not understand why you removed the possibility to have the option to get just raw data. Why is it necessary? Honestly we use an Amazon seller partner as a source connector. Time after time happens when amazon changes key names or add/remove keys... So how will you handle this in a flattened tables case? Keep in mind that keep raw + normalize is also duplicated data...
We definitely need to keep our raw tables for a few reasons. one I already described above. Also it is comfortably dynamically parsed data. We already have built our stuff based on that logic...
So please keep the possibility of requests just raw. It gives flexibility.
At last, how to determine which version will be safe for upgrade to keep guaranteed raw data integrity? But anyway, normalizing the table also takes additional resources... Today already stuck Airbyte server cuz over CPU usage and we increased resources for it...
It is easy anytime flat table from raw. But not as easy do it reverse.
@Hesperide What we must expect if we will drop existing normalized/final table? Airbyte will create it again from raw, including full data/data from the start?
@Hesperide any feedback?
This is really important for us.
P.s. looks lite it starts additional containers for normalization... So it need additional resources :(
@edgao @jbfbell @evantahler @cynthiaxyin Please guys say something :)
Hey @animer3009 . Sorry for the late reply, and for the ongoing issues you are experiencing. A few more questions:
New raw tables are not in the destination schema as you mentioned.
The raw tables on destination-bigquery
version 2.0+ should all be in airbyte_internal
. There should be no raw tables in any other schema. If this isn't the case for you, where are they being stored?Hi @Hesperide,
I got answers about issues described in this ticket, So now to avoid any misunderstanding I complain:
we have a lot of old raw tables, for first I need to be sure that new raw tables definitely include all old raw data (my checks shows that it is) and need some tool/option to start dropping them. As I mentioned we do API calls dynamically to create sources/connections. Our destination prefix includes some report variables and year. As you understand, for example 2022 year will never appear in the prefix, so airbyte will never copy data to the new raw table. This means that we need to sort them and selectively delete old raw tables. We have a really big amount of them. manually dropping is impossible. But anyway we can handle that. It is ok. Just need to be sure that new raw tables will definitely contain old raw tables data.
We use a full refresh + append. So asking that support/guarantee keep raw in all versions . As I described we definitely need to use raw. First, sometimes amazon (source) changes keys and we do not wish to miss something. Also we adapted to parse raw data. For us it is comfortable and acceptable.
We wish/ask to avoid table normalization. First normalization needs to start with additional containers, so it needs additional resources. We already added additional 1/2 resources, as airbyte started crashing after upgrade... also as I mentioned we need just raw data, it saves time and resources. Also data is not doubled in raw and normalized tables. From raw it is possible to normalize anytime. Reverse it is more hard to do, as data can be deduplicated there with non acceptable logics for us.
To help understand how important (for us) stuff I am talking about, I am providing a screen of currently running stuff in our Airbyte. lol
Each has its own source with different variables and prefixes as well.
Here you can see how upgrade effected on CPU resources:
Thanks for all the feedback! Will be reviewing with the team. To some of your questions:
airbyte_internal
should already contain all of your old raw data, as well as new data that has been synced since you upgraded to v2.0+.@Hesperide Thank you for your feedback. We upgraded to platform 0.50.29 and destination 2.0.8.
Waiting for additional feedback.
Thank you.
@animer3009 - can you break that CPU chart down for us? Are the increases of CPU consumption coming from the platform, source, or destination containers? We think that Destinations V2 shouldn't have any additional CPU load, and if does, that's a bug we should fix.
Hi @evantahler , After yesterday upgrade cpu usage looks the same. Just as resources increased, peaks look lower.
These last two are captured in a small time gap. LMK if you wish to run the specific command.
@animer3009 based on your screenshots it looks like the airbyte-worker
container seems to be consuming the most CPU. This is a platform container and is responsible for managing the source and destination containers. Is it possible that you now have more simultaneous syncs? My theory being that if we're creating the final typed tables in v2 this is bit more work than writing to the untyped tables in the airbyte_internal
schema and would take a bit longer than not creating them, and if you're using a cron to schedule syncs the timing might overlap now. In your screenshot it looks like there are 5 active read
and 5 active write
containers which I assume means there are 5 simultaneous syncs happening. This should be fine but would potentially explain the rise in CPU usage on the worker container.
Hi @jbfbell , Nope there was less syncs as usual. Nope we does not use cron. We create Source, add Connection via API call and request sync at the same time as well. After we monitor job status and delete source and connection as well. We have limit in .env file and that because:
MAX_SYNC_WORKERS=5 MAX_SPEC_WORKERS=5 MAX_CHECK_WORKERS=5 MAX_DISCOVER_WORKERS=5 MAX_NOTIFY_WORKERS=5
SYNC_JOB_MAX_ATTEMPTS=3 SYNC_JOB_MAX_TIMEOUT_DAYS=3 SYNC_JOB_INIT_RETRY_TIMEOUT_MINUTES=5 JOB_MAIN_CONTAINER_CPU_REQUEST= JOB_MAIN_CONTAINER_CPU_LIMIT=1 JOB_MAIN_CONTAINER_MEMORY_REQUEST= JOB_MAIN_CONTAINER_MEMORY_LIMIT=1g
TEMPORAL_HISTORY_RETENTION_IN_DAYS=7
Just not sure why it is not considering my config and sets limit 2GIB.
I am complaining exactly the same. You guys removed the option to not do normalization... Now it uses extra (doubled) data space as well... I do not need any normalized objects. This will save me time, space and resources (If not review other reasons)...
It will not make sense that default do normalization and add option to definitely not do final tables and keep as raw?
@animer3009 thank you so much for your detailed feedback. I think this issue is confused into 2 parts:
The vast majority of Airbyte users want final tables with real columns from the source(s). That said, at the moment, Typing and Deduping should be faster than it is, and we are tracking improvements in https://github.com/airbytehq/airbyte/issues/30418 and https://github.com/airbytehq/airbyte/issues/28380. Since storage is so cheap, once we get the time and CPU cost lower, we believe most Airbyte users will be satisfied. That said, T&D will always incur more warehouse cost than doing nothing, by leaving data in just the raw tables. We will bring that option back eventually, but we are focusing on the majority of users who want typed final tables first. I've added more information to the docs via https://github.com/airbytehq/airbyte/pull/30455, and describe our thinking a little more in #30454, which is where we will track adding back an option to skip making raw tables.
As you are an OSS user, you always have the option to remain on V1 destinations until you are satisfied with the options V2 destinations provide.
@evantahler Thank you for your response and resources. I will take on a new issue.
P.s. We have no duplicated data at all. For the beginning we build our process to avoid exactly that. For every new report we create a new source with different time ranges and options. We use a new source to create connection and sync data. We use a full refresh - append (as differently was not supported to keep historical data. Full refresh without appending deletes old data and not all syncs support incremental. Also some incremental keys were not acceptable for our logics).
We can't really go back. As we have a lot of objects. Legacy raw tables have no new data. New data is mixed in new raw tables. So we will need to identify new data and copy them... Also we wish to keep up to date our Airbyte and new version platform does not support V1. To imagine our situation I will provide some screen:
So https://github.com/airbytehq/airbyte/issues/30454 will really help us (I was not able to access slack. looks like it is internal).
p.s. I really do not understand, going with the described direction how you are going to handle data structure change. I mean, it happens when the source can get a key (column name) change. Or add/remove keys. Ok add is easy, you will add new. Remove looks normal as well, as you can just ignore that. What about change? It will look as new for you and you will add a new column, but you will not be able to deduplicate it. If deduplication is unique key based, it can be not acceptable in some cases. So having a raw option was the best I guess. Plus as I understand now Airbyte knows where to continue normalization. Scared that, to save space later you will start cleaning up raw tables.
If deduplication is unique key based, it can be not acceptable in some cases.
Can you share more about this? Deduplication today is explicitly based on the primary key(s) as defined by the source. What else would you want to de-duplicate on?
Scared that, to save space later you will start cleaning up raw tables.
This is the point - we may very well choose to remove raw tables. It is no longer part of the contract Airbyte makes with our users. This is the point of the final tables - those will always remain. If you do not want de-duplication and you do want record history, disable deduplication in your sync... but still use the final tables please.
@evantahler Looks like stuff was changed, so I will need to ask additional questions and will make some notes based on Amazon Seller Partner and Amazon ads Source connectors...
In this case, I thought that deduplication would happen based on Cursor Field. Because it is not acceptable and in some cases does not exist, we decided not to use incremental - append
stuff at all and we simulate incremental by creating source connectors, with different time ranges and variables.
That is good, as for example Amazon SP source connector does not offer any PK-s.
This means deduplication never happens. Right?
What about full refresh - append
(we use that). Theoretically deduplication must never happen here as well. Even if PK will exist. Or it may be sync level (multiple streams). Right?
Now about normalization itself. I am not sure how the normalization process identifies columns. Does it use pre-defined schema definition if it exists? For example, in case of Amazon SP and Amazon ads source connectors predefined/hard coded schema exists. It is not acceptable for us. For the first, types do not look good. Second it is not up to date, not all columns included. You may know that amazon report schema can be slightly different based on account type as well. So if it uses predefined schema, it will be a big issue for us. We will miss some data and amazon has some limitations, how far you can go in past dates. It is different for various reports. Please have in mind that raw structure is independent from pre-defined one.
I see this is not relevant any more. Or just for destinations who do not support V2 yet. right?
As we can't not use incremental and create source connectors/connections for each sync, we can't use this option as well. Right?
In our case, if Amazon SP schema definition will be changed and we accidentally upgrade a source connector, how normalization will perform with option (Detect and propagate schema changes
) Ignore?
Pause
is not an option for us. As we definitely need to get a report. Even if I will get a schema change notification, I am not sure what I need to do. Like just rename the old table OR?
Propagate column changes only
- will only add new columns? I mean will not remove old columns right?
Propagate all changes
- what does it do? It will change PK - s as well and will continue deduplication based on new PK? Or will it recreate a normalized table from the raw based on the new PK?
Ok, let's continue.
Happens, when you will start to sync and process brakes. In that case you can get a little data already pulled.
To not keep garbage (incomplete data) in the database (bigquery), We automatically identify those rows and delete them (from raw tables as we do not use normalized ones).
Let's discuss the case when stream 1
failed but pulled 3 rows and stream 2
was successful and pulled fully 10 rows.
If the schema will not have PK -s we may get duplicates in the normalized table as well.
In our case we do not care, as we identify and delete those rows (please have in mind that Amazon SP has no PK-s). How will the Airbyte/BigQuery destination connector perform in this case?
I mean, if we delete rows in raw, possibly it will be faster and a normalized table gets those rows.
If we will wait and delete from normalized one and raw keep it, will it re copy those?
I am trying to identify if we need to delete those rows from both objects...
Or now if we will drop normalized tables How Airbyte/Bigquery connector will perform?
Next may I have one more question, but as for the normalized schema structure case for me it does not look clear, first I will wait for your response.
And once again, please please please implement that https://github.com/airbytehq/airbyte/issues/30454 opt-out. This will really help us and a lot of others avoid a headache.
I guess I am the only idiot who upgrades online production servers lol. So we faced these problems first I guess. I personally know several guys who use raw stuff. They are still on V1 :)
p.s.
disable deduplication in your sync
I was not able to find this option.
I'm having a little trouble understanding your message, but maybe this information will help:
Regarding schema propagation, it seems like your use-case is schema-sensitive, so I would suggest disabling autopropagation, and periodically checking in the app if there are updates (or setting up notifications/weboohooks). We are working towards giving specific options (cc @nataliekwong) about how autopropagation handles additions and deletes distinctly, but today autopropagation:
@evantahler While I will figure out about this stuff, can you please share which lines I need to comment on to opt-out normalization? The goal is to locally build a bigquery destination connector and use dev version to avoid normalization process. After that I will delete already normalized tables to free up some space. I am really not proficient in java, cuz I am asking for that.
If you want to disable Typing and Deduping from a new destination, you'll need to update updateTable()
in BigQuerySqlGenerator.java
. Or, keep your eye on https://github.com/airbytehq/airbyte/issues/30454
Hi @evantahler Here are my changes for my local build.
https://github.com/animer3009/airbyte/commit/943ef2e7dd07f757258ce6a3676214908729ee09
I'm going to close this issue. I think you have what you ned to change the destination to your liking, and we are working on https://github.com/airbytehq/airbyte/issues/30454 to allow folks to opt-out of T&D
Connector Name
destination-bigquery
Connector Version
2.0.3
What step the error happened?
During the sync
Revelant information
We used Airbyte 0.50.6 with several sources and one destination bigquery 1.6.0 ...
After upgraded to Airbyte 0.50.26 with destination bigquery 2.0.3 ...
We noticed that logics are changed for destination. It started moving existing objects data to airbyte_internal schema. That is ok! For the beginning it was scary, but after realising that it keeps old objects and moves data from old to new structure we were happy. Just there is some issue: We request raw data as usual. But at the same time, destination BigQuery started object normalization in the old schema. So we have an old schema now with old objects (old data as well) and with new normalized objects (including new data). Same time we have raw data in the new airbyte_internal schema.
So we do not request normalization and it anyway creates it...
Relevant log output
No response
Contribute