fractal-analytics-platform / fractal-server

Fractal backend
https://fractal-analytics-platform.github.io/fractal-server/
BSD 3-Clause "New" or "Revised" License
10 stars 3 forks source link

Optimize getting aggregated group/users data in API #1742

Open tcompa opened 1 week ago

tcompa commented 1 week ago

Given the LinkUserGroup table, we sometimes have to aggregate its full content based on one of its two columns (either group_id or user_id). We currently do it "by hand", as in

https://github.com/fractal-analytics-platform/fractal-server/blob/26a37067a5cbcb79d45f0540135ae17c8ef833aa/fractal_server/app/routes/auth/group.py#L29-L58

Can we implement this with groupby? See e.g. https://docs.sqlalchemy.org/en/20/core/selectable.html#sqlalchemy.sql.expression.Select.group_by

tcompa commented 1 week ago

We reviewed this with @mfranzon - see also #1745.

The bottom line is that we found some viable solutions, but we will only run full benchmarks when this becomes relevant in terms of performance. Notice that this is an admin-only endpoint, and that it will be some time before we hit large numbers of users and/or groups.


Option 1: itertools.groupby

See https://docs.python.org/3/library/itertools.html#itertools.groupby

Code like


    # Get all links, sorted by `group_id`
    stm_links = select(LinkUserGroup).order_by("group_id")
    res = await db.execute(stm_links)
    links = res.scalars().all()

    # Enrich group objects with `user_ids` attribute
    for ind, (group_id, group_elements_iterator) in enumerate(
        itertools.groupby(links, key=lambda _link: _link.group_id)
    ):
        if group_id != groups[ind].id:
            raise HTTPException(
                status_code=500,
                detail=(
                    f"Error while creating `user_ids` for {group_id=}, "
                    f"with {ind=} and {groups[ind]=}."
                ),
            )
        groups[ind] = dict(
            groups[ind].model_dump(),
            user_ids=[link.user_id for link in group_elements_iterator],
        )

Option 2: sql query

GROUP BY should be used together with an aggregation function. Possible options are e.g. the ones in https://docs.sqlalchemy.org/en/20/core/functions.html#selected-known-functions:

        from sqlalchemy import func
        from sqlalchemy.orm import join

        SEPARATOR = ","

        stm = (
            select(
                UserGroup,
                func.aggregate_strings(LinkUserGroup.user_id, SEPARATOR),
            )
            .select_from(
                join(
                    LinkUserGroup,
                    UserGroup,
                    LinkUserGroup.group_id == UserGroup.id,
                )
            )
            .group_by(LinkUserGroup.group_id)
            .order_by(UserGroup.id)
        )
        res = await db.execute(stm)
        enriched_groups = []
        for row in res.all(): # loop over groups
            group, user_ids_string = row[:]
            user_ids = [int(_id) for _id in user_ids_string.split(SEPARATOR)]
            enriched_groups.append(dict(group.model_dump(), user_ids=user_ids))
        return enriched_groups