at-gmbh / personio-py

a lightweight Personio API client
https://at-gmbh.github.io/personio-py
Apache License 2.0
26 stars 17 forks source link

requesting pages with pagination doesn't work #18

Closed syphar closed 3 years ago

syphar commented 3 years ago

we are replacing our internal implementation with this library, and stumbled onto a problem we have.

We are building a sync, so we have to request all records for a certain table, in this case the absence/time-off table.

So, to compare: we have around 27k records in our timeoff-table.

Now, the official way to do this would be:

p = Personio(
    client_id=xxx,
    client_secret=yyy,
   )
e = p.get_employees()
print(len(p.get_absences(e)))

This prints: 4922 records. (far too little)

When I drop down to the internal method request_paginated, what I get is:

p = Personio(
    client_id=xxx,
    client_secret=yyy,
   )
print(len(p.request_paginated("company/time-offs")["data"]))

Which gives me 200 records (only one page).

IMHO the pagination doesn't work right now, only the impact for the current users of this library is lower, because you always request only the records for a subset of the users, likely with a smaller db.

our workaround

I used the following implementation, using offset as if it was a page parameter (starting with 1, incrementing by 1):

def _request_paginated(
    api: Personio,
    path: str,
    limit=200,
) -> Iterator[Dict[str, Any]]:
    offset = 1

    while True:
        response = api.request_json(
            path,
            params={
                "limit": limit,
                "offset": offset,
            },
            method="GET",
            auth_rotation=True,
        )

        yield from response["data"]

        if response["metadata"]["current_page"] == response["metadata"]["total_pages"]:
            break
        else:
            offset += 1 

def get_absences(api: Personio) -> Iterator[Absence]:
    response = _request_paginated(api, "company/time-offs")

    for d in response:
        a = Absence.from_dict(d, api)
        yield a

With this code I get exactly the same records in the output like in total_elements (27918)

I'm happy to do a PR with the fixes, if you also can reproduce this. Or talk to personio support, if that's a bug on their side (it is at least in the documentation).

syphar commented 3 years ago

I decided I'll ask their support right now. Still I assume that the problem exists also for you

syphar commented 3 years ago

the documentation is really highly confusing on this API.

Pagination attribute to identify which page you are requesting, by the form of telling an offset from the first record that would be returned.

Which could mean that the argument should be which page you are requesting, but also it could mean an offset from the first record...

I wrote an email to them, let's see.

klamann commented 3 years ago

Hi @syphar, thanks for reporting this issue. I looked into this and I'm not sure if I never really tested this or if they've changed the API, but right now it looks like the offset does refer to the page and not to an offset from the first element. Also, the pages use a 1-based index, so we shouldn't start the offset at 0, because page 0 and page 1 return the exact same result (weird :roll_eyes:).

Furthermore, about your use case: If you want to get the absences of all employees and not just for a specific list of employees, then we can query the API directly without the employees filter, which should result in shorter request times. The same should work for the attendances endpoint. We should make the employees parameter optional for get_absences and get_attendances.

Also, I like that you're using generators in your implementation. Would you like to submit a PR?

edit: now that I had a closer look at it, I remember why I didn't use generators; my goal was to preserve the json structure of the response and "fake" the paginated response so that the structure remains the same. Rewriting this function so that it uses generators would also require a few architectural changes in the application. Not sure if it's worth the effort, but if we want it, we should handle this in a separate PR.

syphar commented 3 years ago

Hi @syphar, thanks for reporting this issue. I looked into this and I'm not sure if I never really tested this or if they've changed the API, but right now it looks like the offset does refer to the page and not to an offset from the first element. Also, the pages use a 1-based index, so we shouldn't start the offset at 0, because page 0 and page 1 return the exact same result (weird ๐Ÿ™„).

Let's see what they answer, in the past we had plenty of bugs on the API side, let's see if this is a documentation- or implementation bug. Then I'll open a PR.

Furthermore, about your use case: If you want to get the absences of all employees and not just for a specific list of employees, then we can query the API directly without the employees filter, which should result in shorter request times. The same should work for the attendances endpoint. We should make the employees parameter optional for get_absences and get_attendances.

That's what I do right now, making the parameter optional would be another PR, when I finished our implementation here.

Also, I like that you're using generators in your implementation. Would you like to submit a PR?

edit: now that I had a closer look at it, I remember why I didn't use generators; my goal was to preserve the json structure of the response and "fake" the paginated response so that the structure remains the same. Rewriting this function so that it uses generators would also require a few architectural changes in the application. Not sure if it's worth the effort, but if we want it, we should handle this in a separate PR.

IMHO for the amount of records we have in most personio systems, the memory usage is not a problem, so using generators here is more a style question. (same style question as probably using @dataclass or pydantic, which I use for most client libraries ๐Ÿ˜„ )

But switching to it in the return types would be a backwards incompatible API change, so one could argue it's not that important :)

klamann commented 3 years ago

Yeah, I haven't been using @dataclass so far because it used to be this fancy new thing in Python 3.7, but since we're at 3.9 now and personio-py actually requires Python 3.7 or later, it might be worth a shot :smile:. pydantic looks pretty awesome btw, wasn't aware of its existence.

Thanks for taking the time to contribute!

syphar commented 3 years ago

I just got an answer from personio: the offset parameter is actually a page parameter, starting from 1, increased by 1.

(totally confusing naming... )

Tomorrow or so I'll prepare a PR to fix this.