fasten-project / fasten

Analyse package dependency networks at the call graph level
https://www.fasten-project.eu
Apache License 2.0
92 stars 28 forks source link

REST API pagination #88

Closed marcomicera closed 4 years ago

marcomicera commented 4 years ago

This is how I see it:

Method Pros Cons
Limit-offset
  • Simple
  • Universal
  • Result inconsistency
  • Inefficacy for high offset values
Cursors
  • Universal
  • Result consistency
  • Not scalable
    • Memory
    • Locks
  • Stateful
Keyset pagination
  • Result consistency
  • Offset efficiency
  • Requires indexed columns :heavy_check_mark:
  • No random access (no offsets)

If we don't mind about inconsistency and performance for high offset values, I'd keep it simple and I'd go for the classic limit & offset approach. @gousiosg any take on this?

gousiosg commented 4 years ago

As our queries will be mostly reading from non-updating data, we don't need cursors as they will be complicating our lives in terms of session management (which we also don't need). If we can restrict our browsing on indexed columns, I 'd recommend we do keyset pagination and always return ordered results (which actually makes sense).

marcomicera commented 4 years ago

Alright, doable with JOOQ's seek() method.

Since we won't be having fixed-size pages, the limit and offset query params don't make much sense anymore. I see two ways:

  1. We get rid of them and use a unique query parameter, e.g., page. I can then emulate pagination by storing page boundaries (i.e., last page values of columns we order by) in a dedicated materialized view that we should then somehow keep updated (immediately with a trigger, deferred with a temporal trigger, or on-demand with a stored procedure).\ Example from that blog post:

    select
      x.value, 
      x.id,
      row_number() 
        over(order by x.value, x.id) + 1 page_number
    from (
      select
        t.id, 
        t.value,
        case row_number() 
             over(order by t.value, t.id) % 5 
          when 0 then 1 
          else 0 
        end page_boundary
      from t
      order by t.value, t.id
    ) x
    where x.page_boundary = 1

    This will then yield:

    | VALUE |  ID | PAGE_NUMBER |
    |-------|-----|-------------|
    |     0 | 786 |           2 |
    |     1 | 250 |           3 |
    |     1 | 959 |           4 |
    |     2 | 229 |           5 |
    |     2 | 533 |           6 | <-- Before page 6
    |     3 |  37 |           7 |
    |     3 | 768 |           8 |
  2. We keep them and use the aforementioned table to determine the page number given a certain offset value. Then we limit the result size with the LIMIT SQL clause.

Both of these solutions require these materialized views that contain page boundaries. Am I missing a simpler solution, maybe?

marcomicera commented 4 years ago
  1. We keep them and use the aforementioned table to determine the page number given a certain offset value. Then we limit the result size with the LIMIT SQL clause.

I'm doing this, but with page and limit as query parameters.

marcomicera commented 4 years ago

I was right in the middle of it when I realized that keyset pagination isn't really suited for our case as we're already filtering those entries for package names and versions all the time.

So for instance such a table:

create materialized view packages_page_limits as
select package_name,
       id,
       row_number()
       over (order by package_name, id) + 1 page_number
from (
         select id,
                package_name,
                case row_number()
                     over (order by package_name, id) % 100
                    when 0 then 1
                    else 0
                    end page_boundary
         from packages
         order by package_name
     ) as x
where x.page_boundary = 1;
select * from packages_page_limits limit 10;
                       package_name                       |    id    | page_number
----------------------------------------------------------+----------+-------------
 aero.m-click:mcpdf                                       |   526850 |           2
 ai.h2o:xgboost4j                                         |   592988 |           3
 ai.tock:tock-bot-engine-jackson                          |  8812137 |           4
 android.compatibility:android-support-v7-appcompat       | 20842521 |           5
 ant:ant-junit                                            |   298684 |           6
 app.myoss.cloud.boot:myoss-starter-web                   |  8548776 |           7
 at.bestsolution.eclipse:org.eclipse.fx.ui.animation      |  7573640 |           8
 at.lux:imageanalysis                                     |  7487539 |           9
 au.com.codeka:carrot                                     |    88521 |          10
 au.com.redboxresearchdata.fascinator:plugin-sso-rapidaaf |  5939684 |          11

would only help me when returning all packages, if I'm not wrong.

I'm switching back to the classic (yet inconsistent & inefficient) offset-limit.

gousiosg commented 4 years ago

Materialized views are not a great solution, as they will need to be updated every time we add a new package. Also, for large tables, like callables and edges they will just replicate the whole (humongous) table with just an extra column. In sort, let's just do page and limit without pre-computing stuff. This is what most APIs do, so it will be more familiar with devs as well.

gousiosg commented 4 years ago

The biggest package in the current database is com.groupdocs:groupdocs-signature, version 18.4 (371k callables). Postgres-based offset/limit pagination works nicely:

fasten_java=> \timing
fasten_java=> select c.fasten_uri
from package_versions pv join packages p on pv.package_id = p.id
  join modules m on m.package_version_id = pv.id
  join callables c on c.module_id = m.id
where
  c.is_internal_call is true
  and p.package_name = 'com.groupdocs:groupdocs-signature'
          AND pv.version = '18.4' limit 100 offset 45000;
Time: 365.154 ms

fasten_java=> select c.fasten_uri
from package_versions pv join packages p on pv.package_id = p.id
  join modules m on m.package_version_id = pv.id
  join callables c on c.module_id = m.id
where
  c.is_internal_call is true
  and p.package_name = 'com.groupdocs:groupdocs-signature'
          AND pv.version = '18.4' limit 100 offset 12000;
Time: 303.800 ms

I would avoid anything more sophisticated until there is a need for it.

gousiosg commented 4 years ago

It is indeed a bit slower if we order by callable.id, but not prohibitively so:

fasten_java=> select c.fasten_uri
from package_versions pv join packages p on pv.package_id = p.id
  join modules m on m.package_version_id = pv.id
  join callables c on c.module_id = m.id
where
  c.is_internal_call is true
  and p.package_name = 'com.groupdocs:groupdocs-signature'
          AND pv.version = '18.4'
order by c.id       
limit 100 offset 12000;
Time: 1477.363 ms (00:01.477)
marcomicera commented 4 years ago

I would avoid anything more sophisticated until there is a need for it.

Yeah, so we agree on this. I dropped the keyset pagination for the offset-limit one https://github.com/fasten-project/fasten/commit/d1b7acba0581acc0efe261bb4149ddbd8f49ef49. If you're ok with this, we can close this issue already.

@romatallinn you now have pagination! Same offset and limit query params as in the good old endpoints table. Should something change in the future, you'll be promptly informed!

marcomicera commented 4 years ago

@gousiosg can we close this?