PostgREST / postgrest

REST API for any Postgres database
https://postgrest.org
MIT License
23.2k stars 1.02k forks source link

perf: Avoid copying the request body more than once #2261

Open steve-chavez opened 2 years ago

steve-chavez commented 2 years ago

When running the memory tests:

$ nix-shell --arg memory true
$ postgrest-test-memory

The following profiling report is produced:

    Fri Apr 22 16:16 2022 Time and Allocation Profiling Report  (Final)

       postgrest +RTS -N -I2 -p -h -RTS

    total time  =        0.01 secs   (32 ticks @ 1000 us, 16 processors)
    total alloc =  22,870,568 bytes  (excludes profiling overheads)

COST CENTRE          MODULE                           SRC                                                      %time %alloc

MAIN                 MAIN                             <built-in>                                                37.5    1.2
text_strict          PostgreSQL.Binary.Decoding       library/PostgreSQL/Binary/Decoding.hs:(252,1)-(260,49)     9.4    6.2
pack                 PostgreSQL.Binary.Integral       library/PostgreSQL/Binary/Integral.hs:(11,1)-(12,52)       6.2    6.9
unknown              Hasql.Private.Encoders           library/Hasql/Private/Encoders.hs:294:1-76                 6.2    0.0
receive              Network.Wai.Handler.Warp.Recv    Network/Wai/Handler/Warp/Recv.hs:(103,1)-(117,69)          6.2    0.3
sendQueryPrepared    Database.PostgreSQL.LibPQ        src/Database/PostgreSQL/LibPQ.hsc:(1729,1)-(1750,40)       3.1   15.0
getvalue'            Database.PostgreSQL.LibPQ        src/Database/PostgreSQL/LibPQ.hsc:(1359,1)-(1367,54)       3.1    3.6
connectdb            Database.PostgreSQL.LibPQ        src/Database/PostgreSQL/LibPQ.hsc:(291,1)-(299,51)         3.1    0.0
decoder              Hasql.Private.Decoders.Value     library/Hasql/Private/Decoders/Value.hs:20:3-7             3.1    2.9
field                Hasql.Private.Decoders           library/Hasql/Private/Decoders.hs:(469,1)-(471,69)         3.1    8.6
dimension            Hasql.Private.Decoders           library/Hasql/Private/Decoders.hs:444:1-73                 3.1    1.6
createPool           Data.Pool                        Data/Pool.hs:(154,1)-(177,10)                              3.1    0.1
cors                 Network.Wai.Middleware.Cors      src/Network/Wai/Middleware/Cors.hs:(291,1)-(409,9)         3.1    0.0
recvRequest          Network.Wai.Handler.Warp.Request Network/Wai/Handler/Warp/Request.hs:(64,1)-(100,50)        3.1    0.0
normalizedBody       PostgREST.Query.SqlFragment      src/PostgREST/Query/SqlFragment.hs:(112,1)-(123,102)       3.1   14.8
logWithZTime         PostgREST.AppState               src/PostgREST/AppState.hs:(143,1)-(145,87)                 3.1    0.3
nullableValue        Hasql.Private.Encoders.Params    library/Hasql/Private/Encoders/Params.hs:(22,1)-(31,47)    0.0   14.9
value                Hasql.Private.Decoders.Row       library/Hasql/Private/Decoders/Row.hs:(42,3)-(55,33)       0.0    5.2
decode               Hasql.Private.Decoders.Row       library/Hasql/Private/Decoders/Row.hs:54:36-92             0.0    2.3
element              Hasql.Private.Decoders           library/Hasql/Private/Decoders.hs:(451,1)-(453,61)         0.0    1.8
composite            Hasql.Private.Decoders           library/Hasql/Private/Decoders.hs:411:1-69                 0.0    1.2
array                Hasql.Private.Decoders           library/Hasql/Private/Decoders.hs:372:1-57                 0.0    1.6
readPGRSTEnvironment PostgREST.Config                 src/PostgREST/Config.hs:(439,1)-(440,83)                   0.0    5.1

It seems the request body is copied 3 times in sendQueryPrepared, normalizedBody, nullableValue. Since we don't do any processing on the body, ideally it would just be copied once and then sent to PostgreSQL.

steve-chavez commented 2 years ago

On normalizedBody, it can be seen the clear culprit is toStrict:

https://github.com/PostgREST/postgrest/blob/902e4e8921779346adc19ff6690d03820be2b797/src/PostgREST/Query/SqlFragment.hs#L111-L123

We need to use that because Hasql needs a strict ByteString on its unknown encoder. Not sure why nullableValue and sendQueryPrepared copy the value again though.

steve-chavez commented 2 years ago

nullableValue

Code path: nullableValue -> encodingBytes -> builderBytes -> unsafeCreate

unsafeCreate :: Int -> (Ptr Word8 -> IO ()) -> ByteString | Source A way of creating ByteStrings outside the IO monad. The Int argument gives the final size of the ByteString. Unlike createAndTrim the ByteString is not reallocated if the final size is less than the estimated size.

So clearly unsafeCreate is doing the allocation here.

sendQueryPrepared

sendQueryPrepared -> useAsCString

useAsCString O(n) construction Use a ByteString with a function requiring a null-terminated CString. The CString is a copy and will be freed automatically; it must not be stored or used after the subcomputation finishes.

useAsCString ditto.

It's a bit of a mess this one. I think LinearTypes would ensure this doesn't happen on libraries.

robx commented 2 years ago

There are some options for making progress on this. The current best state is https://github.com/PostgREST/postgrest/pull/2349, which

The postgresql-libpq change seems like a reasonable change to get merged upstream. Regarding hasql, it's a bit less clear how to fit our needs into the API nicely -- currently exploring this.

For the postgresql-libpq change in particular we could move ahead and build against a patched fork since the API doesn't change.

robx commented 2 years ago

Also let's note some options that go beyond these "easy" wins:

wolfgangwalther commented 2 years ago

Does "streaming the body" mean the database connection is taken from the pool and blocked once a http request starts and is still ongoing?

I think this could, while improving memory usage, decrease performance overall, because it would keep the database connections blocked for longer, saturating the pool faster.

I think there is value in making sure to receive a full request first and then sending it to the database at once. And that includes making a single copy of the request body, I assume.

robx commented 2 years ago

Does "streaming the body" mean the database connection is taken from the pool and blocked once a http request starts and is still ongoing?

Indeed. It's a good point, thanks, easy to get carried away when optimizing a single parameter :).

steve-chavez commented 2 years ago

For the postgresql-libpq change in particular we could move ahead and build against a patched fork since the API doesn't change.

Sounds good, we can use the fork for now.

We might be able to read the body into a strict bytestring directly. If we get a content-length header, we should be able to allocate the body buffer directly. That could tie in with the changes above, and avoid the copy from lazy bytestring to bytestring. Using libpq (the C library) is what requires us to copy the full body to memory at all.

Just to clarify. The above means that even with the optimizations in the current libraries, we'll always need one extra copy of the request body right?


Regarding the content-length header, it should work good if we can fallback to doing one copy when the header is not present.

The postgresql-pure idea looks even better. If it takes a while to do it perhaps we can do the content-length one(if it's simple) and work on postgresql-pure as a next step?

robx commented 2 years ago

Just to clarify. The above means that even with the optimizations in the current libraries, we'll always need one extra copy of the request body right?

Yes, with the current changes, we always read the body to memory in chunks (a lazy bytestring), and need to copy that to a single chunk to pass on to libpq. (Which is why I'm confused by the numbers in the memory tests -- I'd expect that we need to keep both the lazy and the strict bytestring in memory at the same time, so a peak memory use of 2x body size.)

steve-chavez commented 1 year ago

Just tested v10 against the latest pre-release(v10.0.0.20221011, includes https://github.com/PostgREST/postgrest/pull/2349) using this benchmark setup with this k6 script(a bulk POST) and varying the number of virtual users(10, 50, 100).

v10

    data_received..............: 6.9 MB 216 kB/s
    data_sent..................: 309 MB 9.7 MB/s
  ✓ failed requests............: 0.00%  ✓ 0    ✗ 43073
    http_req_blocked...........: avg=6.98µs  min=1.23µs  med=2.27µs  max=2.71ms p(90)=3.65µs  p(95)=5.29µs 
    http_req_connecting........: avg=3.11µs  min=0s      med=0s      max=2.29ms p(90)=0s      p(95)=0s     
  ✓ http_req_duration..........: avg=6.08ms  min=2.45ms  med=5.84ms  max=1.77s  p(90)=8.23ms  p(95)=9.03ms 
    http_req_receiving.........: avg=58.51µs min=14.5µs  med=44µs    max=3.79ms p(90)=61.58µs p(95)=84.56µs
    http_req_sending...........: avg=55.05µs min=25.81µs med=51.35µs max=2.28ms p(90)=68.14µs p(95)=83.53µs
    http_req_tls_handshaking...: avg=0s      min=0s      med=0s      max=0s     p(90)=0s      p(95)=0s     
    http_req_waiting...........: avg=5.97ms  min=2.3ms   med=5.73ms  max=1.77s  p(90)=8.11ms  p(95)=8.92ms 
    http_reqs..................: 43074  1353.110185/s
    iteration_duration.........: avg=6.99ms  min=3.24ms  med=6.73ms  max=1.77s  p(90)=9.25ms  p(95)=10.1ms 
    iterations.................: 43073  1353.078771/s
    vus........................: 0      min=0  max=10 
    vus_max....................: 10     min=10 max=10 

    data_received..............: 8.1 MB 245 kB/s
    data_sent..................: 350 MB 11 MB/s
  ✓ failed requests............: 1.52%  ✓ 751  ✗ 48508
    http_req_blocked...........: avg=6.73µs  min=1.22µs  med=2.21µs  max=2.03ms   p(90)=3.61µs  p(95)=5.28µs 
    http_req_connecting........: avg=2.98µs  min=0s      med=0s      max=680.92µs p(90)=0s      p(95)=0s     
  ✓ http_req_duration..........: avg=29.6ms  min=3ms     med=26.27ms max=2.95s    p(90)=39.85ms p(95)=49.13ms
    http_req_receiving.........: avg=54.57µs min=18.04µs med=44.07µs max=5.79ms   p(90)=60.14µs p(95)=79.9µs 
    http_req_sending...........: avg=54.43µs min=28.01µs med=51.28µs max=4.87ms   p(90)=66.9µs  p(95)=80.96µs
    http_req_tls_handshaking...: avg=0s      min=0s      med=0s      max=0s       p(90)=0s      p(95)=0s     
    http_req_waiting...........: avg=29.49ms min=2.9ms   med=26.15ms max=2.95s    p(90)=39.74ms p(95)=49.02ms
    http_reqs..................: 49260  1491.34639/s
    iteration_duration.........: avg=30.51ms min=3.83ms  med=27.15ms max=2.95s    p(90)=40.79ms p(95)=50.16ms
    iterations.................: 49259  1491.316115/s
    vus........................: 0      min=0  max=50 
    vus_max....................: 50     min=50 max=50 

    data_received..............: 8.5 MB 226 kB/s
    data_sent..................: 367 MB 9.7 MB/s
  ✓ failed requests............: 2.15%  ✓ 1114  ✗ 50648
    http_req_blocked...........: avg=13.88µs min=1.22µs  med=2.2µs   max=9.44ms p(90)=3.54µs  p(95)=5.02µs 
    http_req_connecting........: avg=10.12µs min=0s      med=0s      max=9.28ms p(90)=0s      p(95)=0s     
  ✓ http_req_duration..........: avg=57.24ms min=3.65ms  med=42.55ms max=7.59s  p(90)=76.13ms p(95)=88.87ms
    http_req_receiving.........: avg=54.14µs min=15.75µs med=44.81µs max=8.56ms p(90)=60.7µs  p(95)=75.48µs
    http_req_sending...........: avg=52.55µs min=26.55µs med=50.99µs max=2.41ms p(90)=66.26µs p(95)=80.64µs
    http_req_tls_handshaking...: avg=0s      min=0s      med=0s      max=0s     p(90)=0s      p(95)=0s     
    http_req_waiting...........: avg=57.14ms min=3.56ms  med=42.44ms max=7.59s  p(90)=76.02ms p(95)=88.75ms
    http_reqs..................: 51763  1373.36774/s
    iteration_duration.........: avg=58.13ms min=4.5ms   med=43.46ms max=7.59s  p(90)=77.01ms p(95)=89.73ms
    iterations.................: 51762  1373.341208/s
    vus........................: 0      min=0   max=100
    vus_max....................: 100    min=100 max=100

v10.0.0.20221011

    data_received..............: 7.0 MB 217 kB/s
    data_sent..................: 313 MB 9.7 MB/s
  ✓ failed requests............: 0.00%  ✓ 0    ✗ 43681
    http_req_blocked...........: avg=7.12µs  min=1.23µs  med=2.27µs  max=3.24ms p(90)=3.66µs  p(95)=5.49µs 
    http_req_connecting........: avg=3.2µs   min=0s      med=0s      max=3.2ms  p(90)=0s      p(95)=0s     
  ✓ http_req_duration..........: avg=6ms     min=2.5ms   med=5.78ms  max=2.12s  p(90)=7.91ms  p(95)=8.67ms 
    http_req_receiving.........: avg=57.65µs min=16.92µs med=43.72µs max=4.34ms p(90)=60.55µs p(95)=83.78µs
    http_req_sending...........: avg=55.41µs min=24.83µs med=51.44µs max=1.8ms  p(90)=68.45µs p(95)=82.51µs
    http_req_tls_handshaking...: avg=0s      min=0s      med=0s      max=0s     p(90)=0s      p(95)=0s     
    http_req_waiting...........: avg=5.89ms  min=2.32ms  med=5.67ms  max=2.12s  p(90)=7.8ms   p(95)=8.55ms 
    http_reqs..................: 43682  1356.964725/s
    iteration_duration.........: avg=6.91ms  min=3.25ms  med=6.67ms  max=2.12s  p(90)=8.92ms  p(95)=9.79ms 
    iterations.................: 43681  1356.933661/s
    vus........................: 0      min=0  max=10 
    vus_max....................: 10     min=10 max=10 

    data_received..............: 8.1 MB 240 kB/s
    data_sent..................: 356 MB 11 MB/s
  ✓ failed requests............: 0.71%  ✓ 357  ✗ 49690
    http_req_blocked...........: avg=7.47µs  min=1.13µs  med=2.21µs  max=4.56ms p(90)=3.63µs  p(95)=5.61µs 
    http_req_connecting........: avg=3.17µs  min=0s      med=0s      max=3.03ms p(90)=0s      p(95)=0s     
  ✓ http_req_duration..........: avg=29.11ms min=4.96ms  med=26.57ms max=3.61s  p(90)=38.4ms  p(95)=46.64ms
    http_req_receiving.........: avg=55.68µs min=16.28µs med=43.85µs max=9.68ms p(90)=59.98µs p(95)=81.03µs
    http_req_sending...........: avg=54.21µs min=27.98µs med=51.28µs max=6.65ms p(90)=66.91µs p(95)=82.28µs
    http_req_tls_handshaking...: avg=0s      min=0s      med=0s      max=0s     p(90)=0s      p(95)=0s     
    http_req_waiting...........: avg=29ms    min=4.86ms  med=26.46ms max=3.61s  p(90)=38.28ms p(95)=46.52ms
    http_reqs..................: 50048  1485.597503/s
    iteration_duration.........: avg=30.04ms min=5.73ms  med=27.47ms max=3.61s  p(90)=39.35ms p(95)=47.57ms
    iterations.................: 50047  1485.567819/s
    vus........................: 0      min=0  max=50 
    vus_max....................: 50     min=50 max=50 

    data_received..............: 8.1 MB 233 kB/s
    data_sent..................: 353 MB 10 MB/s
  ✓ failed requests............: 1.24%  ✓ 620   ✗ 49134
    http_req_blocked...........: avg=12.85µs min=1.2µs   med=2.2µs   max=8.09ms  p(90)=3.61µs  p(95)=5.19µs 
    http_req_connecting........: avg=9.09µs  min=0s      med=0s      max=7.78ms  p(90)=0s      p(95)=0s     
  ✓ http_req_duration..........: avg=59.51ms min=3.16ms  med=44.94ms max=4.74s   p(90)=74.99ms p(95)=90.09ms
    http_req_receiving.........: avg=55.63µs min=19.82µs med=44.88µs max=29.96ms p(90)=60.25µs p(95)=77.31µs
    http_req_sending...........: avg=53.33µs min=29.16µs med=51.37µs max=7.62ms  p(90)=66.36µs p(95)=79.67µs
    http_req_tls_handshaking...: avg=0s      min=0s      med=0s      max=0s      p(90)=0s      p(95)=0s     
    http_req_waiting...........: avg=59.4ms  min=2.76ms  med=44.83ms max=4.74s   p(90)=74.89ms p(95)=89.94ms
    http_reqs..................: 49755  1427.870379/s
    iteration_duration.........: avg=60.43ms min=6.29ms  med=45.83ms max=4.74s   p(90)=75.89ms p(95)=91.07ms
    iterations.................: 49754  1427.84168/s
    vus........................: 0      min=0   max=100
    vus_max....................: 100    min=100 max=100

Looks like there's a slight improvement over the % of failed requests and throughput.

Also saw less %MEM on top while doing the load tests. v10 reached %MEM 11.8 and v10.0.0.20221011 reached %MEM 9.8.

steve-chavez commented 1 year ago

Using libpq (the C library) is what requires us to copy the full body to memory at all.

@robx Say if:

Would that allow us to have a single copy?