airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.7k stars 4.03k forks source link

[source-posthog] does not pull all `persons` for some projects #45128

Open absorbb opened 3 weeks ago

absorbb commented 3 weeks ago

Connector Name

source-posthog

Connector Version

1.1.9

What step the error happened?

During the sync

Relevant information

I believe there is a bug https://github.com/PostHog/posthog/issues/24787 in PostHog API that makes it return less person records then provided limit even when there are more Person records available in the project.

For example, a request to https://app.posthog.com/api/projects/PROJECT_ID/persons?limit=100 might return only 98 records for project having millions of persons.

It looks like DefaultPaginator stops loading further data in such case. Not sure what would be proper way to workaround that. Maybe to always make the extra call with offset += count until no record is returned.

Relevant log output

No response

Contribute

marcosmarxm commented 3 weeks ago

Thanks for reporting this issue. @natikgadzhi maybe this can be a good issue for community devs.

Twixes commented 3 weeks ago

Hi folks! It looks like the Airbyte<>PostHog integration uses the GET /projects/<project_id>/persons/ endpoint, which has the issue described in https://github.com/PostHog/posthog/issues/24787. The better way of doing this is the newer POST /projects/<project_id>/query/ with a body of:

{
  "kind": "ActorsQuery",
  "select": [
    "id",
    "created_at",
    "properties",
    // possibly other columns
  ],
  "limit": 100,
  "offset": n
}

That guarantees the number of entries returned is indeed 100 if there are more than 100 available. How can we get this change into Airbyte @marcosmarxm?

marcosmarxm commented 3 weeks ago

@Twixes, you're welcome to contribute and make the changes. If you have questions about how to contribute, please contact me on Slack.