Open killthekitten opened 1 week ago
Downgrading to 2.7.1 results in the same error.
@airbytehq/destinations can someone take a look into this issue? My guess during the upgrade running discover schema a table was deleted or lost permission but the state still exist. @killthekitten can you check the state message (Settings -> Advance) and check if all streams/tables are in there also match the ones in Replication tab?
tl;dr this sounds like a source bug, but try downgrading destination-bigquery to 2.6.3 as a workaround?
WorkerException: A stream status has been detected for a stream not present in the catalog
this sounds like the platform detecting that the source emitted an invalid stream status message (which is not the same as a state message ๐ ). (relevant platform code here)
And I think the reason this works on the pre-dv2 destination-bigquery version is that platform only triggers this validation when the destination supports refreshes (code.) Destination Bigquery added support for refreshes in 2.7.0, so downgrading to an older version will mask the bug.
so @marcosmarxm I think it's worth tagging either DB sources to investigate source-postgres, or platform move to check the platform code - not sure which one is more likely though :/
@edgao downgrade to 2.6.3 worked!
@marcosmarxm can't find the state information anywhere (looked under general settings, connection and connector settings). Could it be that the latest Airbyte doesn't show this info anymore? I've been looking for it earlier today already.
@edgao it could very well be a source issue. Not sure how relevant this is, but we've been seeing a lot of these errors in logs:
https://github.com/airbytehq/airbyte/issues/17319
Is there anything else I could look up to help with investigation?
that sounds like an unrelated thing (platform logs out some record validation stuff, but afaik it's purely informational and doesn't actually affect sync success)
pinged the db sources team internally to take a look at this, stream statuses are a relatively new thing so it's possible that there's some rough edges in that region
The table in question is "public"."admin_notes". The error is happening in the context of initial sync completing. We need more logging, context in order to try to reconstruct what went wrong here.
This looks like user cursor incremental sync (i.e no CDC or xmin)
I tried recreating a similar scenario with all the information we have so far. Not seeing the problem
@rodireich admin_notes showing up in logs before the error could be a coincidence, asย this stream is first in the list when you sort alphabetically.
Is there any secure channel where I can send you the full logs? I could send both the successful and the failing sync logs. Also, don't have the access to the link you shared so can't confirm whether the setup matches ours.
This looks like user cursor incremental sync (i.e no CDC or xmin)
Correct, we use the cursor mode. That said, we perform a full rewrite on all streams
@killthekitten are you in our community slack? We could do the log transfer there
@evantahler messaged you just now ๐
Thank you! Internal reference to slack thread for Airbyte folks looking for the log
I am having the same issue with LinkedIn, PostHog and Hubspot after the upgrade to 2.8.0. Downgrading to 2.6.3 solved the problem. I had the same error in the logs about the stream status. Let me know if I can supply any additional information :)
io.airbyte.workers.exception.WorkerException: A stream status has been detected for a stream not present in the catalog
Hello @killthekitten and @PieterBos94, We have created a new release which is changing the error being logged by adding which is the missing stream. Would it be possible to upgrade your platform and re-run with the version 2.8.0 of big-query?
Thanks,
Would it be possible to send me the catalog associated with the connection?
The request to get it is SELECT "catalog" FROM public."connection" where id = <CONNECTION_ID>;
Could you also let me know if there is a stream prefix set. It can be find in the setting tab of the connection.
@benmoriceau
Would it be possible to upgrade your platform and re-run with the version 2.8.0 of big-query?
I won't be able to test the new release on our current instance, unfortunately, at least not this week (please ping me again if needed in the future).
That said, I took time to re-create our airbyte installation in a new GCP project and to my surprise, the sync didn't fail on 2.8.0. The only differences I can think off:
Would it be possible to send me the catalog associated with the connection?
Sure, I've exported the catalog data and passed it to @evantahler.
Could you also let me know if there is a stream prefix set. It can be find in the setting tab of the connection.
There's no prefix
Thank you @killthekitten, Looking at the catalog, the schema looks ok. I would like to make sure that it is the catalog coming from the old deployment and not the new one.
What is happening is that the version 2.8.0 of big query is adding the a new functionality named refresh. This now require having a stream status being forwarded to the destination. When this functionality is activated, we are validating that the stream status are valid. We have deployed this to our internal deployment of Airbyte and we didn't run into any issue.
I am going to run a comparison of the format of the catalog that you send and an example of our internal catalog to see if I see anything that could explain the issue.
Thanks,
@benmoriceau confirmed, this is the catalog from the old deployment but downgraded to bigquery 2.6.3.
I can also make a dump of the fresh deployment's catalog (2.8.0) if that is helpful.
Could you briefly describe what is a status?
@killthekitten, I believe that I am starting to understand the issue. It seems to be the same root cause than https://github.com/airbytehq/airbyte/issues/39900. We have introduce a new metadata in the connector definition. In order to properly propagate the metadata, we need to upgrade the platform before we upgrade the connector so the platform can properly read the metadata.
In order to confirm this could you let me know the result of the following query
SELECT supports_refreshes
FROM public.actor_definition_version
where docker_repository = 'airbyte/destination-bigquery' and docker_image_tag = '2.8.0';
I would like to confirm my assumption because I don't understand why it wasn't the same error than the other tickets.
I am thinking about a way to fix that for the old deployment which requires as few manual action as possible.
@benmoriceau here you go:
airbyte=> SELECT supports_refreshes
FROM public.actor_definition_version
where docker_repository = 'airbyte/destination-bigquery' and docker_image_tag = '2.8.0';
supports_refreshes
--------------------
t
(1 row)
In order to properly propagate the metadata, we need to upgrade the platform before we upgrade the connector so the platform can properly read the metadata.
FYI I saw the message in BigQuery changelog and made sure the platform was running on 0.63.0 before I updated to BigQuery 2.8.0.
Thanks @killthekitten, It seems that the sequence of updates was right which means that the functionalities supported by the destination have been properly imported.
I am still looking for the root cause, I will make a fix to prevent failing the sync in the platform.
Could you confirm that only the attempt were failing, if a new job is created, is it still failing? I wonder if we were using the old image tag for the config but for a connector which has been updated. It was fixed here But this got release in a platform version you are currently using (release version was 0.63.1).
@benmoriceau sorry, not sure I understood the question! IIRC the moment we switched the connector version, all attempts started failing and/or crashing the instance, only one succeeded out of a dozen. Is that what you asked?
@killthekitten I am wondering if the jobs start failing after the X attempts failed. It should then create a new job, I am wondering what is the status of this job.
@benmoriceau our logs are a mess, as a bit of panic was involved and I don't remember how many times I cancelled the jobs while switching the platform version etc, so can't say which of the jobs were created after using up all attempts. Here is how it looked like (from earliest to latest):
Thanks @killthekitten,
Is the error reported in the ticket from the 1
step?
Could you send me the result of the following query so I can see the failure reason of all the attempts and make sure that we are going to address the right error:
SELECT j.id, j.config, j."status", a."status", a.failure_summary, a."output"
FROM public.jobs j
join attempts a on a.job_id = j.id
where j."scope" = '<YOUR_CONNECTION_ID>'
order by j.id desc
limit 50;
I am also on the community slack under the name of Benoit Moriceau (Airbyte)
Hello @killthekitten, Would you be able to upgrade the platform, bump the version of the big query destination to 2.8.0 or more and re-run the sync.
I am not seeing anything which would point me to a root cause in the query result. Upgrading to the latest version of the platform will add some stream name information in the failure message. It will help me to check if it is in the catalog of the job.
Thanks,
I could try tomorrow afternoon!
Thanks!
On a side note we have a related issue: every time an airbyte instance goes down, any failed syncs are retried, but the retried attempt leaves the dataset in an inconsistent state. I believe this has started when I rolled back 2.8.0.
Could this be related to the refresh functionality that I reverted? Should I file a new issue?
@killthekitten Yes it seems related to the refreshes functionality on the destination side. For some reason the destination only has the latest attempt data. @stephane-airbyte FYI as destination OC.
@benmoriceau
Would you be able to upgrade the platform, bump the version of the big query destination to 2.8.0 or more and re-run the sync.
I upgraded the platform to 0.63.4 and the destination to 2.8.1, and the sync has been completed successfully. Here are the differences compared to the time it failed:
Could the issue have been caused by the fact that the sync was cancelled, and then 2.8.0 attempted to resume it?
@benmoriceau update:
A couple of syncs are not stable anymore (maybe a coincidence), so I'm reverting back to 2.6.3 once again. Would appreciate any hints about resolving the inconsistent state issues on 2.6.3.
If that is still relevant, I've found the "stream status has been detected" message in one of the logs:
io.airbyte.workers.exception.WorkerException: A stream status (public.transaction) has been detected for a stream not present in the catalog
at io.airbyte.workers.helper.StreamStatusCompletionTracker.track(StreamStatusCompletionTracker.kt:36) ~[io.airbyte-airbyte-commons-worker-0.63.4.jar:?]
at io.airbyte.workers.general.BufferedReplicationWorker.readFromSource(BufferedReplicationWorker.java:361) ~[io.airbyte-airbyte-commons-worker-0.63.4.jar:?]
at io.airbyte.workers.general.BufferedReplicationWorker.lambda$runAsyncWithHeartbeatCheck$3(BufferedReplicationWorker.java:242) ~[io.airbyte-airbyte-commons-worker-0.63.4.jar:?]
at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804) ~[?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
These are present only in some cases. The table naming is confusing in the logs: in some places I see public.{{stream_name}}
, in others it is public_{{stream_name}}
and also there's a bunch of public:{{stream_name}}
. I'd assume this is just formatting issues.
Another failure of this kind but in Stripe 3.17.4 <> BigQuery 1.2.20 sync:
io.airbyte.workers.exception.WorkerException: A stream status (public.transaction) has been detected for a stream not present in the catalog
at io.airbyte.workers.helper.StreamStatusCompletionTracker.track(StreamStatusCompletionTracker.kt:36) ~[io.airbyte-airbyte-commons-worker-0.63.4.jar:?]
at io.airbyte.workers.general.BufferedReplicationWorker.readFromSource(BufferedReplicationWorker.java:361) ~[io.airbyte-airbyte-commons-worker-0.63.4.jar:?]
at io.airbyte.workers.general.BufferedReplicationWorker.lambda$runAsyncWithHeartbeatCheck$3(BufferedReplicationWorker.java:242) ~[io.airbyte-airbyte-commons-worker-0.63.4.jar:?]
at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804) ~[?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
@killthekitten Thanks for the logs,
The way of writing the namespace and name is just a formating in the log.
For this connection, is the stream transaction
selected? It will also be helpful to get the catalog use in the job which fails with the error A stream status (public.transaction) has been detected for a stream not present in the catalog
.
select config
from jobs
where id = <JOB_ID>
The transaction
stream is selected in this case, yes. What I've noticed, this could be any of the streams, I've seen different streams failing in different jobs. I've shared the config in DM ๐
@killthekitten for the thing about only having 54 records after two attempts - is that still happening? (if yes, can you post the logs? I'm on the airbyte slack, feel free to DM me)
@edgao I think that i have a fix for the missing stream in catalog. I wonder if the missing data in the destination is a side effect of that because we are not sending the COMPLETED status to the destination since the orchestrator failed.
@edgao yes, it is still an issue! logs sent ๐
@killthekitten Sorry if I missed that before but I would like to confirm that you are running with docker and without having the orchestrators activated.
@benmoriceau correct, we run with docker. What would be an orchestrator? Pretty sure we don't run anything like that.
The orchestrator is a component which run on its own pod an is transferring data between the source and the destination. It is disabled by default on docker but enabled by default on kube. There might be some differences around how the catalog is loaded in the different modes. That being said, I didn't managed to reproduce the issue on my local with the same configuration than you.
That's a pity. We did major version bumps and changed the stream configuration too many times in the last year, so it must be some broken state either in the config DB or airbyte_internal that is hard to reproduce
Connector Name
destination-bigquery
Connector Version
2.8.0
What step the error happened?
During the sync
Relevant information
Airbyte 0.63.2, 0.63.1 Postgres source 3.4.19 BigQuery destination 2.8.0
We have upgraded Airbyte from 0.60.0 to 0.63.2 and switched to the latest version of the BigQuery and Postgres connectors, and immediately after one of our Postgres -> BigQuery connections started to fail.
There were a couple times when the Airbyte instance became unavailable and logs were lost, but for all other cases, this is the error that appears in every log:
In the other Postgres -> BigQuery connections we don't see the same issue, but in all of them the destination is pinned to BigQuery v1.2.20 (i.e. predating destinations V2). The failing one was updated from another v2 version that was a couple months old.
The connection fails after the first attempt to read a stream:
Out of ~15-20 attempts to sync only one finished successfully.
Relevant log output
Sorry for providing very limited logs โ I wouldn't want to expose these logs in public, but would be happy to share them outside of GitHub.
Contribute