[Bug] Out of bounds memory access

rhuanbarreto commented 2 months ago

When running tests using pglite, i'm getting this error many times:

bun test v1.1.29 (6d43b366)

index.test.ts:

# Unhandled error between tests
-------------------------------
1 | import{b as Ge,d as se,e as Ce,f as We,g as X,h as G,i as ie,j as re}from"./chunk-SJVDOE3S.js";import{A as Q,B as Ae,C as $,D as Me,E as Y,F as Ve,G as Re,a as ze,b as je,c as ae,d as le,e as ce,f as de,g as ue,h as pe,i as fe,j as me,k as he,l as ye,m as ge,n as we,o as xe,p as be,q as C,r as ve,s as Ee,t as Pe,u as _e,v as Te,w as Fe,x as Se,y as W,z as De}from"./chunk-7M5G4I52.js";import"./chunk-TAZBUB4Z.js";import{a as ne,b as oe,c as Qe,d as $e}from"./chunk-3BCYDNSE.js";import{b as Le,c as qe,d as i,e as p,f,g as S,i as h}from"./chunk-Y3AVQXKT.js";h();h();h();var et=new Error("timeout while waiting for mutex to become available"),tt=new Error("mutex already locked"),Ke=new Error("request for lock canceled"),Xe=function(u,t,e,s){function d(r){return r instanceof e?r:new e(function(n){n(r)})}return new(e||(e=Promise))(function(r,n){function o(c){try{l(s.next(c))}catch(m){n(m)}}function a(c){try{l(s.throw(c))}catch(m){n(m)}}function l(c){c.done?r(c.value):d(c.value...

RuntimeError: Out of bounds memory access (evaluating 'this.mod._pg_initdb()')
      at postgres.wasm.wasm-function[sort_checkpoint_bufferids_med3]
      at postgres.wasm.wasm-function[sort_checkpoint_bufferids]
      at postgres.wasm.wasm-function[CheckPointBuffers]
      at postgres.wasm.wasm-function[CheckPointGuts]
      at postgres.wasm.wasm-function[CreateCheckPoint]
      at postgres.wasm.wasm-function[ShutdownXLOG]
      at postgres.wasm.wasm-function[pg_proc_exit]
      at postgres.wasm.wasm-function[pg_initdb]
      at /<redacted>/node_modules/@electric-sql/pglite/dist/index.js:1:10594
-------------------------------

I'm using PGlite v0.2.6 and Bun v1.1.29

This is flaky. Sometimes it works, sometimes doesn't. Any help on fixing this?

samwillis commented 1 month ago

Hey @rhuanbarreto

Thanks for the report. If you could narrow down the cause and produce a reproduction it would be super helpful.

If you want to dig in yourself, setting the debug level to 1 with PGlite.create({ debug: 1 }) will output a trace of what Postgres is doing internally. It usually gives a good indication of where these errors can occur.

rhuanbarreto commented 1 month ago

Upgraded to v0.2.8 and enabled debug. Got a new related error (still happens at initdb):

pglite: no db
# WARNING: program "postgres" is needed by initdb but was not found in the same directory as "/tmp/pglite/bin/initdb"
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

# 3442
The database cluster will be initialized with this locale configuration:
  provider:    libc
  LC_COLLATE:  C
  LC_CTYPE:    C.UTF-8
  LC_MESSAGES: C
  LC_MONETARY: C
  LC_NUMERIC:  C
  LC_TIME:     C
# 2651
# 2705
# 2706
# 3444
The default text search configuration will be set to "english".
# 3446

Data page checksums are disabled.

# 3458
creating directory /tmp/pglite/base ... ok
creating subdirectories ... ok
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
# 3461: TODO: fsync_pgdata ?
# FIXING: int shmget (key_t __key=738, size_t __size=40, int __shmflg=1920) pagesize default=65536
# FIXING: void *shmat (int __shmid=666, const void *__shmaddr=0, int __shmflg=0)
# 360: InitPostgres(boot): /home/runner/work/pglite/pglite/postgresql-16.4/src/backend/bootstrap/bootstrap.c
bun test v1.1.29 (6d43b366)

tests/facilities.test.ts:
popen failure: Function not implemented
# popen["/tmp/pglite/bin/postgres" --boot -X 1048576  -F -c log_checkpoints=false  ] (BOOT)
# pg_pclose(/tmp/initdb.boot.txt) 129:../../src/include/pg_config_os.h
# popen["/tmp/pglite/bin/postgres" --single -F -O -j -c search_path=pg_catalog -c exit_on_error=true -c log_checkpoints=false  template1 >/dev/null] (SINGLE)
# pg_pclose(/tmp/initdb.single.txt) 129:../../src/include/pg_config_os.h
2024-09-30 10:33:13.246 GMT [42] DEBUG:  invoking IpcMemoryCreate(size=144228352)
2024-09-30 10:33:13.311 GMT [42] DEBUG:  dynamic shared memory system will support 224 segments
2024-09-30 10:33:13.341 GMT [42] DEBUG:  created dynamic shared memory control segment 3475055906 (5388 bytes)
2024-09-30 10:33:13.342 GMT [42] DEBUG:  transaction ID wrap limit is 2147483650, limited by database with OID 1
2024-09-30 10:33:13.342 GMT [42] DEBUG:  MultiXactId wrap limit is 2147483648, limited by database with OID 1
2024-09-30 10:33:13.342 GMT [42] DEBUG:  creating and filling new WAL file
2024-09-30 10:33:13.346 GMT [42] DEBUG:  done creating and filling new WAL file
2024-09-30 10:33:13.347 GMT [42] DEBUG:  InitPostgres
2024-09-30 10:33:13.347 GMT [42] NOTICE:  database system was shut down at 2024-09-30 10:33:13 GMT
2024-09-30 10:33:13.348 GMT [42] DEBUG:  checkpoint record is at 0/100028
2024-09-30 10:33:13.348 GMT [42] DEBUG:  redo record is at 0/100028; shutdown true
2024-09-30 10:33:13.348 GMT [42] DEBUG:  next transaction ID: 3; next OID: 10000
2024-09-30 10:33:13.348 GMT [42] DEBUG:  next MultiXactId: 1; next MultiXactOffset: 0
2024-09-30 10:33:13.348 GMT [42] DEBUG:  oldest unfrozen transaction ID: 3, in database 1
2024-09-30 10:33:13.348 GMT [42] DEBUG:  oldest MultiXactId: 1, in database 1
2024-09-30 10:33:13.348 GMT [42] DEBUG:  commit timestamp Xid oldest/newest: 0/0
2024-09-30 10:33:13.348 GMT [42] DEBUG:  transaction ID wrap limit is 2147483650, limited by database with OID 1
2024-09-30 10:33:13.348 GMT [42] DEBUG:  MultiXactId wrap limit is 2147483648, limited by database with OID 1
2024-09-30 10:33:13.348 GMT [42] DEBUG:  starting up replication slots
2024-09-30 10:33:13.348 GMT [42] DEBUG:  xmin required by slots: data 0, catalog 0
2024-09-30 10:33:13.350 GMT [42] DEBUG:  starting up replication origin progress state
2024-09-30 10:33:13.351 GMT [42] DEBUG:  reading stats file "pg_stat/pgstat.stat"
2024-09-30 10:33:13.351 GMT [42] DEBUG:  MultiXactId wrap limit is 2147483648, limited by database with OID 1
2024-09-30 10:33:13.351 GMT [42] DEBUG:  MultiXact member stop limit is now 4294914944 based on MultiXact 1

# Unhandled error between tests
-------------------------------
1 | import{b as Ge,d as se,e as Ce,f as We,g as X,h as G,i as ie,j as re}from"./chunk-SJVDOE3S.js";import{A as Q,B as Ae,C as $,D as Me,E as Y,F as Ve,G as Re,a as ze,b as je,c as ae,d as le,e as ce,f as de,g as ue,h as pe,i as fe,j as me,k as he,l as ye,m as ge,n as we,o as xe,p as be,q as C,r as ve,s as Ee,t as Pe,u as _e,v as Te,w as Fe,x as Se,y as W,z as De}from"./chunk-7M5G4I52.js";import"./chunk-TAZBUB4Z.js";import{a as ne,b as oe,c as Qe,d as $e}from"./chunk-3BCYDNSE.js";import{b as Le,c as qe,d as r,e as p,f,g as S,i as h}from"./chunk-Y3AVQXKT.js";h();h();h();var et=new Error("timeout while waiting for mutex to become available"),tt=new Error("mutex already locked"),Ke=new Error("request for lock canceled"),Xe=function(d,t,e,s){function c(n){return n instanceof e?n:new e(function(a){a(n)})}return new(e||(e=Promise))(function(n,a){function u(o){try{l(s.next(o))}catch(m){a(m)}}function i(o){try{l(s.throw(o))}catch(m){a(m)}}function l(o){o.done?n(o.value):c(o.value).then(u,i)}l((s=s.apply(d,t||[])).next())}

RuntimeError: call_indirect to a null table entry (evaluating 'this.mod._pg_initdb()')
      at postgres.wasm.wasm-function[InputFunctionCall]
      at postgres.wasm.wasm-function[OidInputFunctionCall]
      at postgres.wasm.wasm-function[InsertOneValue]
      at postgres.wasm.wasm-function[boot_yyparse]
      at postgres.wasm.wasm-function[BootstrapModeMain]
      at postgres.wasm.wasm-function[pg_initdb]
      at /home/vsts/work/r1/a/_typescript/node_modules/@electric-sql/pglite/dist/index.js:1:10623
-------------------------------

pmp-p commented 1 month ago

Why does "bun test v1.1.29 (6d43b366)" appears in the initdb flow ? is something trying to run concurrently ?

i don't get where this block is coming from

bun test v1.1.29 (6d43b366)

tests/facilities.test.ts:
popen failure: Function not implemented

jamesgpearce commented 1 month ago

I just started getting this in my tests after upgrading from 0.2.7 to 0.2.10, using Jest.

Looks like 0.2.8 is where it starts failing. I can try to isolate more tomorrow...

    RuntimeError: memory access out of bounds

      at _emscripten_memcpy_bulkmem (wasm:/wasm/postgres.wasm-028403ca:1:7169865)
      at __memcpy (wasm:/wasm/postgres.wasm-028403ca:1:7169850)
      at __fwritex (wasm:/wasm/postgres.wasm-028403ca:1:7191391)
      at fwrite (wasm:/wasm/postgres.wasm-028403ca:1:7191480)
      at internal_putbytes (wasm:/wasm/postgres.wasm-028403ca:1:6002810)
      at socket_putmessage (wasm:/wasm/postgres.wasm-028403ca:1:6003857)
      at pq_endmessage (wasm:/wasm/postgres.wasm-028403ca:1:5951676)
      at EmitErrorReport (wasm:/wasm/postgres.wasm-028403ca:1:3898103)
      at errfinish (wasm:/wasm/postgres.wasm-028403ca:1:3890102)
      at ShutdownXLOG (wasm:/wasm/postgres.wasm-028403ca:1:1501253)
      at pg_proc_exit (wasm:/wasm/postgres.wasm-028403ca:1:1942531)
      at pg_shutdown (wasm:/wasm/postgres.wasm-028403ca:1:476636)
      at Object.Module._pg_shutdown (node_modules/@electric-sql/pglite/dist/postgres.js:9554:118)
      at H._pg_shutdown [as close] (node_modules/@electric-sql/pglite/src/pglite.ts:462:7)

UPDATE: I was closing the database too quickly and it hadn't completed unlistening to a channel. In 0.2.8 and above this seems important!

samwillis commented 1 month ago

@jamesgpearce

I was closing the database too quickly and it hadn't completed unlistening to a channel. In 0.2.8 and above this seems important!

A few versions before 0.2.8 had a bug that was resulting in a .close() not checkpointing the WAL. Postgres then always had to replay the WAL on start. We fixed that, but it may be something odd is happen as a result (we may have messed something)

Is the error happing on restart or after the close call?

I'm keen to reproduce this so we can sure it's prevented. Do you have a rough idea of the sequence that triggers it?

rhuanbarreto commented 1 month ago

I'm working on checking if it's a race condition. We are using pglite in the test suite but we are not initializing the DB in the beforeAll phase. So maybe this is generating a race condition

jamesgpearce commented 1 month ago

I'm keen to reproduce this so we can sure it's prevented. Do you have a rough idea of the sequence that triggers it?

I'm tempted to say I'm in the wrong! I didn't await the unlisten (the async handle that is returned from listen) before immediately closing. Awaiting that (or putting in a tiny sleep) was enough to prevent the OOB error.

electric-sql / pglite

[Bug] Out of bounds memory access #339