codex-storage / nim-codex

Decentralized Durability Engine
Apache License 2.0
63 stars 23 forks source link

[BUG] Sqlite race condition? #486

Closed elcritch closed 1 year ago

elcritch commented 1 year ago

Describe the bug

I saw this in a failed coverage test and it looks like a possible race condition?

(https://github.com/codex-storage/nim-codex/actions/runs/5583954068/jobs/10227140646#step:4:473)9) sqlite3_bind_text
/home/runner/work/nim-codex/nim-codex/vendor/nim-sqlite3-abi/sqlite3.c(84622) bindText
/home/runner/work/nim-codex/nim-codex/vendor/nim-sqlite3-abi/sqlite3.c(84572) vdbeUnbind
/home/runner/work/nim-codex/nim-codex/vendor/nim-sqlite3-abi/sqlite3.c(26039) sqlite3_mutex_enter
/home/runner/work/nim-codex/nim-codex/vendor/nim-sqlite3-abi/sqlite3.c(26584) pthreadMutexEnter
/home/runner/work/nim-codex/nim-codex/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(631) signalHandler

It seemed good to note.

To Reproduce

Run lots of tests. :) If it's a race condition of some sort it'll be hard to find?

Expected behavior

Not sigfaulting!

Environment:

CI, code coverage test.

Additional context

Full stacktrace:

testCodex: tpp.c:83: __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)' failed.
Traceback (most recent call last, using override)
/home/runner/work/nim-codex/nim-codex/tests/codex/utils/testasyncstatemachine.nim(510) main
/home/runner/work/nim-codex/nim-codex/tests/codex/utils/testasyncstatemachine.nim(500) NimMain
/home/runner/work/nim-codex/nim-codex/tests/codex/utils/testasyncstatemachine.nim(491) PreMain
/home/runner/work/nim-codex/nim-codex/tests/codex/utils/testasyncstatemachine.nim(114) PreMainInner
/home/runner/work/nim-codex/nim-codex/vendor/asynctest/asynctest/templates.nim(34) atmcodexatssalesatstestsalesdotnim_Init000
/home/runner/work/nim-codex/nim-codex/vendor/asynctest/asynctest/templates.nim(25) colonanonymous
/home/runner/work/nim-codex/nim-codex/vendor/nim-chronos/chronos/asyncloop.nim(1203) waitFor
/home/runner/work/nim-codex/nim-codex/vendor/nim-chronos/chronos/asyncloop.nim(295) poll
/home/runner/work/nim-codex/nim-codex/vendor/nim-chronos/chronos/asyncfutures2.nim(398) internalContinue
/home/runner/work/nim-codex/nim-codex/vendor/nim-chronos/chronos/asyncfutures2.nim(372) futureContinue
/home/runner/work/nim-codex/nim-codex/codex/utils/asyncstatemachine.nim(76) scheduler
/home/runner/work/nim-codex/nim-codex/codex/utils/asyncstatemachine.nim(55) run
/home/runner/work/nim-codex/nim-codex/vendor/nim-chronos/chronos/asyncfutures2.nim(369) futureContinue
/home/runner/work/nim-codex/nim-codex/codex/utils/asyncstatemachine.nim(57) run
/home/runner/work/nim-codex/nim-codex/codex/utils/asyncstatemachine.nim(45) run
/home/runner/work/nim-codex/nim-codex/codex/sales/states/downloading.nim(34) run
/home/runner/work/nim-codex/nim-codex/vendor/nim-chronos/chronos/asyncfutures2.nim(369) futureContinue
/home/runner/work/nim-codex/nim-codex/codex/sales/states/downloading.nim(51) run
/home/runner/work/nim-codex/nim-codex/codex/sales/reservations.nim(272) markUsed
/home/runner/work/nim-codex/nim-codex/vendor/nim-chronos/chronos/asyncfutures2.nim(369) futureContinue
/home/runner/work/nim-codex/nim-codex/codex/sales/reservations.nim(274) markUsed
/home/runner/work/nim-codex/nim-codex/codex/sales/reservations.nim(140) get
/home/runner/work/nim-codex/nim-codex/vendor/nim-chronos/chronos/asyncfutures2.nim(369) futureContinue
/home/runner/work/nim-codex/nim-codex/codex/sales/reservations.nim(142) get
/home/runner/work/nim-codex/nim-codex/codex/sales/reservations.nim(130) exists
/home/runner/work/nim-codex/nim-codex/vendor/nim-chronos/chronos/asyncfutures2.nim(369) futureContinue
/home/runner/work/nim-codex/nim-codex/codex/sales/reservations.nim(135) exists
/home/runner/work/nim-codex/nim-codex/vendor/nim-datastore/datastore/datastore.nim(43) contains
/home/runner/work/nim-codex/nim-codex/vendor/nim-chronos/chronos/asyncfutures2.nim(369) futureContinue
/home/runner/work/nim-codex/nim-codex/vendor/nim-datastore/datastore/datastore.nim(44) contains
/home/runner/work/nim-codex/nim-codex/vendor/nim-datastore/datastore/datastore.nim(16) has
/home/runner/work/nim-codex/nim-codex/vendor/nim-datastore/datastore/sql/sqliteds.nim(32) has
/home/runner/work/nim-codex/nim-codex/vendor/nim-chronos/chronos/asyncfutures2.nim(369) futureContinue
/home/runner/work/nim-codex/nim-codex/vendor/nim-datastore/datastore/sql/sqliteds.nim(39) has
/home/runner/work/nim-codex/nim-codex/vendor/nim-datastore/datastore/sql/sqliteutils.nim(85) query
/home/runner/work/nim-codex/nim-codex/vendor/nim-datastore/datastore/sql/sqliteutils.nim(69) bindParam
/home/runner/work/nim-codex/nim-codex/vendor/nim-sqlite3-abi/sqlite3.c(8[472](https://github.com/codex-storage/nim-codex/actions/runs/5583954068/jobs/10227140646#step:4:473)9) sqlite3_bind_text
/home/runner/work/nim-codex/nim-codex/vendor/nim-sqlite3-abi/sqlite3.c(84622) bindText
/home/runner/work/nim-codex/nim-codex/vendor/nim-sqlite3-abi/sqlite3.c(84572) vdbeUnbind
/home/runner/work/nim-codex/nim-codex/vendor/nim-sqlite3-abi/sqlite3.c(26039) sqlite3_mutex_enter
/home/runner/work/nim-codex/nim-codex/vendor/nim-sqlite3-abi/sqlite3.c(26584) pthreadMutexEnter
/home/runner/work/nim-codex/nim-codex/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(631) signalHandler
SIGABRT: Abnormal termination.
Aborted (core dumped)
stack trace: (most recent call last)
/home/runner/work/nim-codex/nim-codex/vendor/nimbus-build-system/vendor/Nim/lib/system/nimscript.nim(429, 18)
/home/runner/work/nim-codex/nim-codex/codex.nimble(73, 16) testTask
/home/runner/work/nim-codex/nim-codex/codex.nimble(63, 8) testCodexTask
/home/runner/work/nim-codex/nim-codex/codex.nimble(57, 8) test
/home/runner/work/nim-codex/nim-codex/vendor/nimbus-build-system/vendor/Nim/lib/system/nimscript.nim(273, 7) exec
/home/runner/work/nim-codex/nim-codex/vendor/nimbus-build-system/vendor/Nim/lib/system/nimscript.nim(273, 7) Error: unhandled exception: FAILED: build/testCodex [OSError]
make[1]: Leaving directory '/home/runner/work/nim-codex/nim-codex'
make[1]: *** [Makefile:85: test] Error 1
make: *** [Makefile:122: coverage] Error 2
Error: Process completed with exit code 2.
emizzle commented 1 year ago

I was just looking into this, and any help would be greatly appreciated. For context, from the stacktrace, it seems this is triggered when reading a value from an in-memory SQLiteDatastore (.contains): https://github.com/codex-storage/nim-codex/blob/7d9a969d310b8c8733b0ffeb7794cdb8ba14e5df/tests/codex/sales/testsales.nim#L197. This only fails during coverage and not during the normal test run.

elcritch commented 1 year ago

Really? Yah I could look into it in a bit. I've not used sqlite3 with Nim before, but maybe a fresh pair of eyes.

This only fails during coverage and not during the normal test run

That's annoying! Have you been able to reproduce it fairly reliably?

emizzle commented 1 year ago

Thank you, much appreciated! I've only seen it in CI. Running coverage locally (make -j{ncpu} coverage) does not fail for me (arm64 macos).

dryajov commented 1 year ago

Are we somehow using the same instance from concurrently, from two different processes?

elcritch commented 1 year ago

Ugh, I wonder if it's something related to the codecoverage? You have to actually have to have GCC compile extra stuff. Have you tried googling just gcc, fprofile-arcs or ftest-coverage, and sqlite3 issues?

elcritch commented 1 year ago

Are we somehow using the same instance from concurrently, from two different processes?

I don't believe so. The CI coverage runs on it's own instance. It appears to run just coverage in the ci.yml and makefile.

emizzle commented 1 year ago

Are we somehow using the same instance from concurrently, from two different processes?

It's an interesting thought, specifically looking at the top of the stacktrace shows testasyncstatemachine, but coverage fails while testing testsales.

/home/runner/work/nim-codex/nim-codex/tests/codex/utils/testasyncstatemachine.nim(510) main
/home/runner/work/nim-codex/nim-codex/tests/codex/utils/testasyncstatemachine.nim(500) NimMain
/home/runner/work/nim-codex/nim-codex/tests/codex/utils/testasyncstatemachine.nim(491) PreMain
/home/runner/work/nim-codex/nim-codex/tests/codex/utils/testasyncstatemachine.nim(114) PreMainInner
/home/runner/work/nim-codex/nim-codex/vendor/asynctest/asynctest/templates.nim(34) atmcodexatssalesatstestsalesdotnim_Init000 # <= /codex/sales/testsales.nim
elcritch commented 1 year ago

It's an interesting thought, specifically looking at the top of the stacktrace shows testasyncstatemachine, but coverage fails while testing testsales.

It looks like testasyncstatemachine.nim is a utility module. It's probably just being called / used from testsales.nim somewhere?

emizzle commented 1 year ago

It looks like testasyncstatemachine.nim is a utility module. It's probably just being called / used from testsales.nim somewhere?

testasyncstatemachine.nim is its own test suite. AFAIK, it's not being imported or called from testsales. I really don't know how or why it's appearing in the same stack trace. The line numbers don't make sense either as testasyncstatemachine.nim only has 135 lines.

emizzle commented 1 year ago

Does make -j{ncpu} coverage use multiple threads somehow to split up execution? I'm wondering if i changed manually to make coverage in CI, if it would solve anything...

elcritch commented 1 year ago

Does make -j{ncpu} coverage use multiple threads somehow to split up execution? I'm wondering if i changed manually to make coverage in CI, if it would solve anything...

As far as I can tell, the make -j${ncpu} coverage doesn't do anything for the Nim pieces, only maybe for dependencies. The Nim compiler is single threaded. I ran and checked it in Activity Monitor to confirm. There's two Nim processes but one is the config.nims test script calling the compiler.

This is the line where Make runs the test: https://github.com/codex-storage/nim-codex/blob/7227a4a38dbff101ca85dd1b3231741ecdf36b1d/Makefile#L122

Our tests unit tests appear to be single threaded too. Or rather there's two threads which I'm guessing is 1 async and 1 network or sqlite thread?

elcritch commented 1 year ago

Hmmm, I searched and don't see any places where another thread would be started in codex explicitly. It looks like sqlite wrapper wrap the query commands and don't do async until they're done.

Also a bit of searching didn't pop any issues with gcc coverage and sqlite3, which is good.

The only other thing I see is this bit in testsales.nim:

https://github.com/codex-storage/nim-codex/blob/7227a4a38dbff101ca85dd1b3231741ecdf36b1d/tests/codex/sales/testsales.nim#L75C1-L75C1

  teardown:
    await repo.stop()
    await sales.stop()

Could it be stopping the store and then trying to run more queries while sqlite/datastore is in some indeterminate state?

elcritch commented 1 year ago

@emizzle did it work?! Either your change or the repo-stop one. https://github.com/codex-storage/nim-codex/actions/runs/5594186633/jobs/10228762718

Or does it need to run a few times to be sure?

emizzle commented 1 year ago

Could it be stopping the store and then trying to run more queries while sqlite/datastore is in some indeterminate state?

This could be the case, but I think we would have the issue pop up prior to the changes in this branch.

The other change I made was to not wait for cancellations in slotqueue.stop. I still don't quite know why this was causing issues (had seen SIGSEGVs as well), but am investigating

elcritch commented 1 year ago

This could be the case, but I think we would have the issue pop up prior to the changes in this branch.

It's possible issues would not have occurred before because all the test sales used to finish before the repo would close?

It looks like the PR's related to queuing market requests. I could see that changing the order to something like:

asyncQueue:
- sales market txn req
- sales market txn req
- close repo
- sales market check req # new based on market queue triggering it? 

It could be similar situation with slotqueue.stop. I'm not sure and would have to dive into the code more but it's my hunch.

Hopefully it's working and you can figure out why. ;)

emizzle commented 1 year ago

It's a great point. I think at the end of the day, you're right, and the change should be made regardless.

elcritch commented 1 year ago

I'll be curious what you find out. Feel free to close this issue, but I'll follow up on it later too if it looks good.

emizzle commented 1 year ago

Let's keep it open and I'll post my findings.

emizzle commented 1 year ago

Thanks for the help!

emizzle commented 1 year ago

I made three changes, in addition to the one you made:

  1. reinstate cancelAndWaits in the slotqueue module, which is what was causing the issues to begin with, and we want
  2. stop repo after stopping sales (your change)
  3. add cancellation callbacks to the then util
  4. tracked and cancelled marketplace callback handlers in the sales module

So far, https://github.com/codex-storage/nim-codex/actions/runs/5594986782/jobs/10230418662 is passing all tests and the coverage is not failing (although uploading of coverage data seems to be, which is not related and we've seen this in the past).

I will try backing out #4 to see if it's necessary, however I'm a bit hesitant because it feels like its one of those things that doesn't hurt to have.

emizzle commented 1 year ago

Removing #4 passed (https://github.com/codex-storage/nim-codex/actions/runs/5595158876/jobs/10230761515) Removing #3 passed (https://github.com/codex-storage/nim-codex/actions/runs/5595263076/jobs/10230971062) Removing #2 failed πŸŽ‰ (https://github.com/codex-storage/nim-codex/actions/runs/5595413677/jobs/10231256862)

So at the end of the day, the only change that was truely needed was the one you made! #3 and #4 can't hurt, but do add additional changes. Will contemplate keeping them or not.

Thanks again @elcritch πŸ™