Closed elcritch closed 1 year ago
I was just looking into this, and any help would be greatly appreciated. For context, from the stacktrace, it seems this is triggered when reading a value from an in-memory SQLiteDatastore
(.contains
): https://github.com/codex-storage/nim-codex/blob/7d9a969d310b8c8733b0ffeb7794cdb8ba14e5df/tests/codex/sales/testsales.nim#L197. This only fails during coverage and not during the normal test run.
Really? Yah I could look into it in a bit. I've not used sqlite3 with Nim before, but maybe a fresh pair of eyes.
This only fails during coverage and not during the normal test run
That's annoying! Have you been able to reproduce it fairly reliably?
Thank you, much appreciated! I've only seen it in CI. Running coverage locally (make -j{ncpu} coverage
) does not fail for me (arm64 macos).
Are we somehow using the same instance from concurrently, from two different processes?
Ugh, I wonder if it's something related to the codecoverage? You have to actually have to have GCC compile extra stuff. Have you tried googling just gcc, fprofile-arcs
or ftest-coverage
, and sqlite3 issues?
Are we somehow using the same instance from concurrently, from two different processes?
I don't believe so. The CI coverage runs on it's own instance. It appears to run just coverage in the ci.yml and makefile.
Are we somehow using the same instance from concurrently, from two different processes?
It's an interesting thought, specifically looking at the top of the stacktrace shows testasyncstatemachine
, but coverage fails while testing testsales
.
/home/runner/work/nim-codex/nim-codex/tests/codex/utils/testasyncstatemachine.nim(510) main
/home/runner/work/nim-codex/nim-codex/tests/codex/utils/testasyncstatemachine.nim(500) NimMain
/home/runner/work/nim-codex/nim-codex/tests/codex/utils/testasyncstatemachine.nim(491) PreMain
/home/runner/work/nim-codex/nim-codex/tests/codex/utils/testasyncstatemachine.nim(114) PreMainInner
/home/runner/work/nim-codex/nim-codex/vendor/asynctest/asynctest/templates.nim(34) atmcodexatssalesatstestsalesdotnim_Init000 # <= /codex/sales/testsales.nim
It's an interesting thought, specifically looking at the top of the stacktrace shows testasyncstatemachine, but coverage fails while testing testsales.
It looks like testasyncstatemachine.nim
is a utility module. It's probably just being called / used from testsales.nim
somewhere?
It looks like testasyncstatemachine.nim is a utility module. It's probably just being called / used from testsales.nim somewhere?
testasyncstatemachine.nim
is its own test suite. AFAIK, it's not being imported or called from testsales
. I really don't know how or why it's appearing in the same stack trace. The line numbers don't make sense either as testasyncstatemachine.nim
only has 135 lines.
Does make -j{ncpu} coverage
use multiple threads somehow to split up execution? I'm wondering if i changed manually to make coverage
in CI, if it would solve anything...
Does
make -j{ncpu} coverage
use multiple threads somehow to split up execution? I'm wondering if i changed manually tomake coverage
in CI, if it would solve anything...
As far as I can tell, the make -j${ncpu} coverage
doesn't do anything for the Nim pieces, only maybe for dependencies. The Nim compiler is single threaded. I ran and checked it in Activity Monitor to confirm. There's two Nim processes but one is the config.nims test script calling the compiler.
This is the line where Make runs the test: https://github.com/codex-storage/nim-codex/blob/7227a4a38dbff101ca85dd1b3231741ecdf36b1d/Makefile#L122
Our tests unit tests appear to be single threaded too. Or rather there's two threads which I'm guessing is 1 async and 1 network or sqlite thread?
Hmmm, I searched and don't see any places where another thread would be started in codex explicitly. It looks like sqlite wrapper wrap the query commands and don't do async until they're done.
Also a bit of searching didn't pop any issues with gcc coverage and sqlite3, which is good.
The only other thing I see is this bit in testsales.nim
:
teardown:
await repo.stop()
await sales.stop()
Could it be stopping the store and then trying to run more queries while sqlite/datastore is in some indeterminate state?
@emizzle did it work?! Either your change or the repo-stop one. https://github.com/codex-storage/nim-codex/actions/runs/5594186633/jobs/10228762718
Or does it need to run a few times to be sure?
Could it be stopping the store and then trying to run more queries while sqlite/datastore is in some indeterminate state?
This could be the case, but I think we would have the issue pop up prior to the changes in this branch.
The other change I made was to not wait for cancellations in slotqueue.stop
. I still don't quite know why this was causing issues (had seen SIGSEGVs as well), but am investigating
This could be the case, but I think we would have the issue pop up prior to the changes in this branch.
It's possible issues would not have occurred before because all the test sales used to finish before the repo would close?
It looks like the PR's related to queuing market requests. I could see that changing the order to something like:
asyncQueue:
- sales market txn req
- sales market txn req
- close repo
- sales market check req # new based on market queue triggering it?
It could be similar situation with slotqueue.stop
. I'm not sure and would have to dive into the code more but it's my hunch.
Hopefully it's working and you can figure out why. ;)
It's a great point. I think at the end of the day, you're right, and the change should be made regardless.
I'll be curious what you find out. Feel free to close this issue, but I'll follow up on it later too if it looks good.
Let's keep it open and I'll post my findings.
Thanks for the help!
I made three changes, in addition to the one you made:
cancelAndWaits
in the slotqueue
module, which is what was causing the issues to begin with, and we wantthen
utilsales
moduleSo far, https://github.com/codex-storage/nim-codex/actions/runs/5594986782/jobs/10230418662 is passing all tests and the coverage is not failing (although uploading of coverage data seems to be, which is not related and we've seen this in the past).
I will try backing out #4 to see if it's necessary, however I'm a bit hesitant because it feels like its one of those things that doesn't hurt to have.
Removing #4 passed (https://github.com/codex-storage/nim-codex/actions/runs/5595158876/jobs/10230761515) Removing #3 passed (https://github.com/codex-storage/nim-codex/actions/runs/5595263076/jobs/10230971062) Removing #2 failed π (https://github.com/codex-storage/nim-codex/actions/runs/5595413677/jobs/10231256862)
So at the end of the day, the only change that was truely needed was the one you made! #3 and #4 can't hurt, but do add additional changes. Will contemplate keeping them or not.
Thanks again @elcritch π
Describe the bug
I saw this in a failed coverage test and it looks like a possible race condition?
It seemed good to note.
To Reproduce
Run lots of tests. :) If it's a race condition of some sort it'll be hard to find?
Expected behavior
Not sigfaulting!
Environment:
CI, code coverage test.
Additional context
Full stacktrace: