airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.96k stars 4.1k forks source link

🐛 Bug: Manual Per-Stream Resets are not working #27040

Closed alex-gron closed 11 months ago

alex-gron commented 1 year ago

I am using a Salesforce connection to BigQuery on Airbyte Cloud

When I manually trigger the reset from the Status page

When I trigger the reset modal by turning off the stream on the Replication tab:

AnilEMaharjan commented 1 year ago

Still seeing this issue in this job

gosusnp commented 1 year ago

When I trigger the reset modal by turning off the stream on the Replication tab:

  • I still have raw data in salesforce.airbyte_raw_Opportunity
    • This is surprising to me. Is this a bug?

I believe this was a choice to not delete data when disabling a stream. It left the user an option to keep the data while no longer syncing.

AnilEMaharjan commented 1 year ago

Part of the issue that I am running into is that I need to actually reset this Stream and get a full resync and when I click the Reset Stream there are 0 records processed

AnilEMaharjan commented 1 year ago

Still seeing this issue in this job

When I ran the above reset job, I got the following outcome:

image
davinchia commented 1 year ago

Sorry for the slow reply - Move is finally back on Platform work in Q4 so we can take a closer look here.

Reading through the thread,

alex-gron commented 1 year ago

@davinchia There is an internal thread in #dev-platform-compose where @malikdiarra and @nataliekwong are discussing, and have the latest information on when this does/doesn't work. I'll tag you in it.

davinchia commented 1 year ago

Relevant thread: https://airbytehq-team.slack.com/archives/C03A8CSANPQ/p1696872453718639?thread_ts=1696007933.907689&cid=C03A8CSANPQ

davinchia commented 12 months ago

Update: this turned out to be a destination issue. See https://airbytehq-team.slack.com/archives/C03C4AVJWG4/p1697823971406669

nataliekwong commented 12 months ago

It looks like the destination bug is now resolved. However, I tested this again with this connection and saw that the state was not cleared and the tables persisted after the reset succeeded. Here's the Loom (no audio) https://www.loom.com/share/08e91d5ac64d497e9b0a83130ab5b5db?sid=e388ce59-fd9b-4ce7-a47e-d03d72fdb5bc

Steps to reproduce:

Observed behavior:

davinchia commented 11 months ago

fyi: @jdpgrailsdev

jdpgrailsdev commented 11 months ago

@davinchia @nataliekwong Has the customer given us permission to run the reset? I would like to run one to see exactly what happens in the logs, as the most recent reset on that connection shows that a sync was executed. That would explain why the state appears to not have been reset. I do not see any reference in the logs to the empty source, which we should see if it actually ran the reset. Instead, it looks like we are just running a sync.

nataliekwong commented 11 months ago

@jdpgrailsdev I think if you're referring to this one, free to do anything on this workspace! https://cloud.airbyte.com/workspaces/56812376-bba8-4eb6-a6a1-df64ea889a5d/connections (it's an internal one).

There's an external customer (Preply) who has not given us permission yet, but I can ask if you were referencing that one.

jdpgrailsdev commented 11 months ago

https://cloud.airbyte.com/workspaces/56812376-bba8-4eb6-a6a1-df64ea889a5d/connections

@nataliekwong I was specifically talking about the one in your Loom recording, which I now realize is in our internal space. Thanks!

jdpgrailsdev commented 11 months ago

@davinchia @nataliekwong Something appears to be broken with resets. I did a local test using the same source configuration from @nataliekwong 's internal example above (Stripe source) sending to the /dev/null source. I can see the reset API being called and the code identifying the streams to reset. However, the worker never runs the reset. It runs a check, followed by a real replication, which is why the state never appears to be reset. My suspicion is something got removed/changed in the Temporal workflow code. I don't see how this is isolated to just our internal workspace or these sources/destinations, so I suspect this is a much broader issue. I am going to add some more debugging locally to see if I can determine what is actually happening.

nataliekwong commented 11 months ago

Thanks @jdpgrailsdev!! And just as an FYI: we're working on backfilling streams on the Compose side as a new feature this cycle. The backfill is done by doing a per-stream reset. We were doing testing, and found similarly that Postgres succeeded with the per-stream reset, whereas Hubspot kept failing. We were a bit puzzled as to why, and finally connected this morning that it likely is dependent on the same issue we see here. We're going to wait until we resolve this issue before we pick that project back up.

cc @flutra-osmani

jdpgrailsdev commented 11 months ago

@nataliekwong That would confirm my suspicion that it is related to the backfill logic added to the ReplicationActivityImpl, which is the bit in the sync workflow that actually kicks off the orchestrator.

jdpgrailsdev commented 11 months ago

Update: after testing both in the internal workspace and locally using the same source connector and configuration, I can confirm that the rest is happening. The reset removes the stream from the state table. However, the UI does not show the updated state unless you refresh the connection page. At that point, the UI displays an error message stating that the state JSON is invalid (because it is null/not present). The UI should probably be modified to handle the "no state" case better and/or note that you may need to refresh the page to see the most recent state after a reset or sync.

jdpgrailsdev commented 11 months ago

Another update: I believe that I have tracked this down. There is a feature flag in the ReplicationActivityImpl chooses between calling the replicate and replicateV2 method. Only the replicateV2 method updates the configured catalog to have the correct destinationSyncMode in the configured catalog used by the destination as part of the reset. For whatever reason, a full reset appears to have the correct destinationSyncMode despite not being modified by the ReplicationActivityImpl. The "Reset this stream" option does not, which is why the staging table in the BigQuery destination does not get cleared out. I cannot tell if this has always existed or whatever introduced the replicateV2 caused this (or something else). In any event, there appears to be a lot of variance between where and how the configured catalog is modified (or not modified) as part of a reset. The feature flag in question is platform.remove-large-sync-inputs and checks on the workspace ID for inclusion (it's actual an inverted check, so if off it uses replicate, if on, it uses replicateV2). Git blame seems to indicate a change made by @Michael Siega back in September, but again, I am unable to determine if this is when this started or if it pre-existed. The fact that the full reset works even though it also goes through the replicate method that does not modified the configured catalog seems to indicate a bigger, pre-existing issue.

alex-gron commented 11 months ago

@jdpgrailsdev I created this original issue in June, so I agree that it doesn't seem like this is new behavior.

flutra-osmani commented 11 months ago

@jdpgrailsdev "replicatev2" path is not yet enabled. I just enabled it for 1% now. It was also enabled for 2 workspaces Natalie was testing on the last couple of days. This new path is tied to the "large sync input" flag. What it does is fetch the catalog and state when needed, instead of relying on the sync input from the upstream workflow.

jdpgrailsdev commented 11 months ago

After a lot of additional digging, I have found the bug. The UI sets the namespace on the individual stream to reset to a blank string when the namespace of the stream is null:

        "resetConfig": {
            "streamsToReset": [
                {
                    "name": "checkout_sessions",
                    "namespace": ""
                }
            ]
        }

When the configured catalog is loaded by the backend, missing namespaces appear as Java null objects, not empty strings. The code that updates the configured catalog to prepare it for reset uses Java equality on the StreamDescriptor representation of the stream:

    public boolean equals(Object other) {
        if (other == this) {
            return true;
        } else if (!(other instanceof StreamDescriptor)) {
            return false;
        } else {
            StreamDescriptor rhs = (StreamDescriptor)other;
            return (this.name == rhs.name || this.name != null && this.name.equals(rhs.name)) && (this.namespace == rhs.namespace || this.namespace != null && this.namespace.equals(rhs.namespace)) && (this.additionalProperties == rhs.additionalProperties || this.additionalProperties != null && this.additionalProperties.equals(rhs.additionalProperties));
        }
    }

Because the UI passes in an empty string, the equality check fails (null != ""), and therefore the catalog is not updated to have the correct destinationSyncMode. Streams that have a non-null/non-blank namespace would not have this problem. Full resets do not have this problem because all streams are fetched on the backend when the reset request is made, therefore having the correct namespace for the Java equality check.

jdpgrailsdev commented 11 months ago

After discussing this with @chandlerprall, the approach will be to mark the namespace as optional in the ConnectionStream domain object in the API spec and stop passing the namespace parameter from the UI if it is blank.

chandlerprall commented 11 months ago

PR resolving the stream namespace issue has been merged. I'm leaving this issue open until it can be confirmed as fixed.

nataliekwong commented 11 months ago

I've tested and verified with two API connections, and this WORKS! 🙌 🙌

What I saw:

jdpgrailsdev commented 11 months ago

@nataliekwong May I close this ticket now that we have confirmed the fix?

nataliekwong commented 11 months ago

Yes! Please go ahead.

Ishankoradia commented 1 month ago

Hi @jdpgrailsdev @chandlerprall i am facing this issue in my oss version 0.50.44. Is this suppose to be fixed ? Can you please let me know if i should be upgrading the version?