airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.52k stars 3.99k forks source link

Source Mailchimp: `email activity` stream missing data #14673

Closed marcosmarxm closed 1 year ago

marcosmarxm commented 2 years ago

This Github issue is synchronized with Zendesk:

Ticket ID: #1553 Priority: normal Group: User Success Engineer Assignee: Nataly Merezhuk

Original ticket description:

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: Ubuntu
  • Memory / Disk: you can use something like 4Gb / 1 Tb
  • Deployment: Kubernetes
  • Airbyte Version: 0.39.25
  • Source name/version: Mailchimp
  • Destination name/version: Json destination
  • Step: Run the Mailchimp source with the Json destination with more than 700 email events
  • Description:
    When I run the Mailchimp source to the JSON destination and compare it with what’s returned from the Mailchimp API’s, the output JSON is missing a lot of data. I spot-checked one campaign and found 61 events in the JSON output and the Mailchimp email_activity endpoint returns around 707.
[Discourse post]
marcosmarxm commented 2 years ago

Comment made from Zendesk by Nataly Merezhuk on 2022-07-12 at 22:36:

Hello, @murph! Could you please show me the Airbyte/server logs so I can see if you are getting an errors?
marcosmarxm commented 2 years ago

Comment made from Zendesk by Marcos Marx on 2022-07-13 at 00:44:

We aren’t getting any errors but we were able to debug this a bit. It looks like every time you paginate the email activity endpoint you’re also incrementing the since param: airbyte/streams.py at bfa54aca50115770530ca6fdff24d4125541d23b · airbytehq/airbyte · GitHub. Via the cursor_field: airbyte/streams.py at bfa54aca50115770530ca6fdff24d4125541d23b · airbytehq/airbyte · GitHub which is the timestamp of the newest record: airbyte/streams.py at bfa54aca50115770530ca6fdff24d4125541d23b · airbytehq/airbyte · GitHub

That means that when we do an incremental sync we lose a lot of records. I don’t think this is the intended behavior?

[Discourse post]
marcosmarxm commented 2 years ago

Comment made from Zendesk by Marcos Marx on 2022-07-13 at 00:57:

Looks like you cannot sort what’s returned from the email activity endpoint so this kind of checkpointing wont work https://mailchimp.com/developer/marketing/api/email-activity-reports/list-email-activity/

[Discourse post]
marcosmarxm commented 2 years ago

Comment made from Zendesk by Nataly Merezhuk on 2022-07-13 at 13:56:

Thanks for digging into this - you are right, this is definitely not the intended behavior. I've opened an issue on Github, I or another team member will start work on this soon!
murphpdx commented 2 years ago

Hi, thanks for filing this issue for us. I just wanted to check in to see if we know when this issue will be prioritized?

marcosmarxm commented 2 years ago

Comment made from Zendesk by Marcos Marx on 2022-07-19 at 20:23:

Thank you for creating that issue! I just wanted to check in to see when the issue will be prioritized?

[Discourse post]
marcosmarxm commented 2 years ago

Comment made from Zendesk by Nataly Merezhuk on 2022-07-21 at 10:46:

@murph sorry for the wait, we have a few team members out this week. I asked one of my colleagues to set aside some time for this issue, so you'll be hearing something soon!
marcosmarxm commented 2 years ago

Comment made from Zendesk by Marcos Marx on 2022-08-25 at 05:13:

Hi just checking in on this, has there been any movement?

[Discourse post]
marcosmarxm commented 2 years ago

Comment made from Zendesk by Nataly Merezhuk on 2022-09-01 at 21:14:

Hi, Amanda! Thank you for your patience. No movement on this yet but I have a few debugging ideas.

Could you possibly update Airbyte to the latest version and try the sync once more? I have tried to replicate the issue on my end, but from what I can see the connector is working correctly: all records emitted by Mailchimp are being committed to JSON. 
marcosmarxm commented 2 years ago

Comment made from Zendesk by Marcos Marx on 2022-09-13 at 19:20:

Did you use an incremental sync?

[Discourse post]
marcosmarxm commented 2 years ago

Comment made from Zendesk by Marcos Marx on 2022-09-13 at 19:21:

I’m not sure why you need to debug more. If you look at the linked code it shows that you’re treating it like the data is sorted but the API is not sorted. You’re also paginating in multiple ways at the same time.

[Discourse post]
davydov-d commented 1 year ago

hey @marcosmarxm could you please verify with affected users if it is still the issue after upgrading the connector to the latest version?

davydov-d commented 1 year ago

The problem described in https://discuss.airbyte.io/t/missing-mailchimp-email-activity-data/1830 must have been fixed in https://github.com/airbytehq/airbyte/pull/20765

murphpdx commented 1 year ago

The problem described in https://discuss.airbyte.io/t/missing-mailchimp-email-activity-data/1830 must have been fixed in #20765

I'm not sure why this was closed. It looks like you still have the bug. It seems like you're still setting the since field to the timestamp: https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-mailchimp/source_mailchimp/streams.py#L154

It looks like cursor_field is set to timestamp. I know you have a sort_field of create_time but Mailchimp does not allow you to change the sorting. That means that you're going to lose a lot of records. The timestamp should stay the same the offset is what should be used to paginate. You will set the offset too offset = offset + pagesize; The since param should not change. https://mailchimp.com/developer/marketing/docs/methods-parameters/#pagination As you can see from the list-email-activity docs, there is no sort_field. I believe you're sort variable is just getting ignored. https://mailchimp.com/developer/marketing/api/email-activity-reports/list-email-activity/

marcosmarxm commented 1 year ago

Comment made from Zendesk by Marcos Marx on 2023-04-03 at 23:11:

Closed due to no response from requester.
marcosmarxm commented 1 year ago

Comment made from Zendesk by Marcos Marx on 2023-04-03 at 23:12:

Closed due to no response from requester.
marcosmarxm commented 1 year ago

Comment made from Zendesk by Marcos Marx on 2023-04-03 at 23:13:

Closed due to no response from requester.
marcosmarxm commented 1 year ago

Comment made from Zendesk by Marcos Marx on 2023-04-03 at 23:14:

Closed due to no response from requester.
marcosmarxm commented 1 year ago

Comment made from Zendesk by Marcos Marx on 2023-04-03 at 23:14:

Closed due to no response from requester.
marcosmarxm commented 1 year ago

Comment made from Zendesk by Marcos Marx on 2023-04-03 at 23:15:

Closed due to no response from requester.
marcosmarxm commented 1 year ago

Comment made from Zendesk by Marcos Marx on 2023-04-03 at 23:16:

Closed due to no response from requester.
marcosmarxm commented 1 year ago

Comment made from Zendesk by Marcos Marx on 2023-04-03 at 23:20:

Closed due to no response from requester.
marcosmarxm commented 1 year ago

Comment made from Zendesk by Marcos Marx on 2023-04-03 at 23:21:

Closed due to no response from requester.
marcosmarxm commented 1 year ago

Comment made from Zendesk by Marcos Marx on 2023-04-03 at 23:21:

Closed due to no response from requester.
marcosmarxm commented 1 year ago

Comment made from Zendesk by Marcos Marx on 2023-04-03 at 23:23:

Closed due to no response from requester.