Updating js-db to 5.x.x (integrating RocksDB and DB-native concurrency control)

emmacasolin commented 1 year ago

Description

Updating CI/CD including:

Continuous benchmarking
Staging branch
Advanced type checking

[x] 1. Create new staging branch
[x] 2. Synchronise changes from js-logger
[X] 3. update @matrixai/db to 5.0.1,
[X] 4. fix API changes for DB
[x] 3. Fix failing tests
[x] 4. Integrate fast-check and update any relevant tests.

Final checklist

[x] Domain specific tests
[x] Full tests
[x] Updated inline-comment documentation
[x] Lint fixed
[x] Squash and rebased
[x] Sanity check the final build

emmacasolin commented 1 year ago

The problem with updating js-logger here (which we need to do in order to update it in PK) is that it also means we need to update the DB, but this also means incorporating all of the new (breaking) changes from 5.0.0 of js-db. I think I've made the correct changes to the source code, but there are several failing tests. This is to be expected though, since js-db has undergone some major changes. Most of the failures are to do with concurrency, which makes sense given the focus of most of the recent changes to the db.

CMCDragonkai commented 1 year ago

Hold on this PR, the js-db update is more pressing first. And we don't need to update the EFS yet. The js-db 5.0.0 is first being updated in PK. EFS can stay on the original js-db.

CMCDragonkai commented 1 year ago

What you can do is to merge in the non-js-db and non-js-logger changes in first. There's just a bunch of CI/CD stuff.

Can you split up the commit into a separate PR for just the staging and ci/cd and other chores.

emmacasolin commented 1 year ago

Hold on this PR, the js-db update is more pressing first. And we don't need to update the EFS yet. The js-db 5.0.0 is first being updated in PK. EFS can stay on the original js-db.

Ok. EFS won't work inside PK if it's running the old DB though, since the old DB uses the old logger, which conflicts with the new logger used by the new DB in PK, so it will need to be updated here eventually.

CMCDragonkai commented 1 year ago

Actually you can do it. You just won't be able to pass a "child" logger into the DB that EFS is using. So you can just remove that parameter entirely.

emmacasolin commented 1 year ago

Now that I've updated the logger for all of the other libraries though we can no longer pass a logger through to any of the other libraries from EFS. I think this is only an issue for js-workers though, which is used only in the benchmarks and in one of the tests (efsWorker.test.ts), so we could just not pass a logger into those.

emmacasolin commented 1 year ago

https://github.com/MatrixAI/js-encryptedfs/pull/67/commits/ee0b9e2171a9cf89f4905cbff7058e1ba8924f39 now contains all of the CI/CD changes but doesn't update the logger or DB. As such, it doesn't pass a logger through to any of the workers since js-workers uses a different version of js-logger. All of the tests should be passing on that commit, but they don't pass on the most recent commit due to concurrency changes from the latest DB.

CMCDragonkai commented 1 year ago

Once #68 is merged, rename this PR to focus on js-db 5.x upgrade. Logger changes should be eliminated here. Since that will be put into #68. Although once rebased, you can also remove the comment.

CMCDragonkai commented 1 year ago

@emmacasolin please update to using 1.3.6 of js-workers. It is now compatible with swc.

The quickfix is to swap from using new X(); to new this(); wherever we have static async creators.

This needs to be fixed in EncryptedFS.ts. Change to:

    const efs = new this({

To make it usable. Then swc should be enabled.

CMCDragonkai commented 1 year ago

The same fix has to be applied to PK as well under staging branch. Can you do it at the same time as you do you js-logger updates there.

tegefaulkes commented 1 year ago

0084570 fixes #69. There still are failing tests in this PR but it passed when tested on top of staging.

CMCDragonkai commented 1 year ago

Creation of temporary directory with os.tmpdir() should still be using path.join as normal.

CMCDragonkai commented 1 year ago

@emmacasolin this PR needs to be split up. The js-db changes should be moved into a new PR.

This PR should only focus on synchronising changes, updating the js-logger, and the fix that @tegefaulkes just added.

CMCDragonkai commented 1 year ago

I suspect that the test failures are coming from the js-db changes. The js-db update should be postponed till later. @tegefaulkes

CMCDragonkai commented 1 year ago

Actually I was confused, the recent fix should go to #68 and #67. Will rename this PR to suit.

@tegefaulkes can you reset the commit you just pushed?

CMCDragonkai commented 1 year ago

When this is being worked on all:

      // @ts-ignore - version of js-logger is incompatible (remove when js-db updates to 5.* here)

Should be removed from the source, tests and benchmarks as js-logger will be compatible and we'd have completed the upgrade.

Unlike in Polykey where it has a commit that we can revert, this PR won't have that convenience.

tegefaulkes commented 1 year ago

Removed the fix commit from here and pushed it to staging.

tegefaulkes commented 1 year ago

Fast-check can be used to combine a number of concurrency tests together. For example, concurrent file writes using a number of methods can be the same test. for these tests the the result matters less than avoiding transaction conflicts. we still want to check that something has been written however.

tegefaulkes commented 1 year ago

The INodeManager is using a LockBox along side the the database transactions. I think we will want to make use of the transaction locks now.

FileDescriptor has its own Lock for locking on cursor updates. This one should be left alone.

tegefaulkes commented 1 year ago

looking at the EncyptedFs > write > can copy files test failure. it has to create the destination file and the destination doesn't exist after creation causing the issue here. Seems that copyFile uses this._open to open a FD AND create the file. but this is not composed with the wider transaction thus causing the problem.

I'm sure that _open is used inside of other methods such as appendFile and open. I think the problem only happens when _open has to create the file when opening it. In any case we need to be able to compose it with larger transactions.

CMCDragonkai commented 1 year ago

Yes swap to use transaction locks instead of LockBox.
Where INodeManager is using tran as the first parameter, swap it to the last parameter in preparation for the decorator usage
When an fd is opened, it has to be opened to a created inode.

ghost commented 1 year ago

👇 Click on the image for a new way to code review

- Make big changes easier — review code in small groups of related files - Know where to start — see the whole change at a glance - Take a code tour — explore the change with an interactive tour - Make comments and review — all fully sync’ed with github [Try it now!](https://app.codesee.io/r/reviews?pr=67&src=https%3A%2F%2Fgithub.com%2FMatrixAI%2Fjs-encryptedfs)

Legend

CodeSee Map Legend

tegefaulkes commented 1 year ago

The INodeManager is now using the transaction locks. So far as I can tell, the tran parameter is already the last parameter where it's used.

I'm not sure what you mean by point 3.

CMCDragonkai commented 1 year ago

For point 3, I just mean when fds are opened. They are always opened to a existing inode. If a fd is alive, inode is alive. You can delete names, but you cannot delete inodes until all links are gone, that includes fds and other names (aka hardlinks).

So yes when opening a fd, it has to check if it needs to create an inode at the same time. So race conditions can occur here due to concurrent creation. To avoid it, observe how normal fs deals with concurrent opens to the same (non-existent inode), and proceed to mirror the behaviour.

Same applies to directories, since everything on the filesystem is an inode.

tegefaulkes commented 1 year ago

The problem here was that copyFile uses _open to create a fd pointing to the destination. It's using the create flag so that the file is created if it didn't exist. At the time this was done in a separate transaction so the copyFile's overall transaction didn't see the file get created resulting in an error.

tegefaulkes commented 1 year ago

For the concurrency tests, we should only really check for if 1 method succeeds and others fail. We shouldn't really care what the reason is besides that it was specifically a ErrorEncryptedFSError and not a transaction conflict. This way I can use the fast-check scheduler to simplify and combine tests while ending up with more robust tests.

An example of this is the concurrent inode creation test. I can combine the 4 existing tests into a single tests that will check random permutations of open, mknod and mkdir. This should cover permutations that we don't already test.

tegefaulkes commented 1 year ago

Is it really worth checking the result of a concurrent write? The expectation is that the writes should complete without error. Should not throw a transaction conflict. What is actually written to the file is technically undefined behaviour so checking what is written feels like unnecessary complexity added to the test. We only really need to check that the number of bytes read is within the expected range.

tegefaulkes commented 1 year ago

re-based on staging.

CMCDragonkai commented 1 year ago

Is it really worth checking the result of a concurrent write? The expectation is that the writes should complete without error. Should not throw a transaction conflict. What is actually written to the file is technically undefined behaviour so checking what is written feels like unnecessary complexity added to the test. We only really need to check that the number of bytes read is within the expected range.

Yes in terms of checking, you can apply a predicate check to the patterns of the resulting concurrent write. This way during fast check, on every iteration they do, which they are changing up the concurrent operation order, we get a good concurrent coverage. If our predicates fail for a certain order, these tests will tell us.

The predicate can be a regular expression. That's how I wrote it already, but I lacked some sort of async scheduler to actually exhaustively check concurrent coverage. Fast check should be capable of exhausting the variations.

CMCDragonkai commented 1 year ago

Also we don't need to change over all the tests. Just fix any of the concurrent tests and add any additional ones to ensure we have consistent behaviour especially relating to transaction conflicts.

The EFS is just getting more bug free, but it will be more important to work together on integrating the tracing and queueing to address testnet tests starting tomorrow.

tegefaulkes commented 1 year ago

This should be good to merge now. Pending review if desired.

CMCDragonkai commented 1 year ago

Can you rerun the benchmarks and show the comparison here. Since the db has changed, I'm expecting it should be alot faster.

CMCDragonkai commented 1 year ago

Oh yea you need to provide the type of the lock. So don't worry.

CMCDragonkai commented 1 year ago

Ok so for all the concurrent tests where you have replaced the test, change to using a regular expression to match the original predicates on the output. I've given some examples above.

For the new tests, this can also be done if you see an easy way to write the regular expression. Prefer this over simply checks like 3.

Then proceed to merge to staging.

On the staging branch, regenerate docs and redo the benchmarks.

CMCDragonkai commented 1 year ago

There's a documented way of doing delaying the actual async functions for the fast check scheduler: https://github.com/dubzzz/fast-check/issues/564. See the relevant commits. I also asked a question upstream to see how it should be done.

The docs also got moved around recently and the relevant commits point to: https://github.com/dubzzz/fast-check/blob/main/packages/fast-check/documentation/RaceConditions.md

There's no inbuilt helper for this.

The docs suggest:

const scheduleCall = <T>(s: Scheduler, f: () => Promise<T>) => {
  s.schedule(Promise.resolve("USE THIS AS A LABEL...")).then(() => f());
}
// Calling doStuff will be part of the task scheduled in s
scheduleCall(s, () => doStuff())

It's very similar to what you already did with the gateFactory.

Using the above method you also don't need a separate factory.

The last interesting thing is fc.scheduledModelRun. But there is no documentation how to use it. It seems to provide a way of model based testing which requires a sort of state machine model first then you're essentially doing transitions in lock step between the abstract and the real machine. This can be used for EFS, but it would only be done at a very high level considering the entire FS state as a model. This would be a completely different set of tests. Best used for more "consistent" systems (like distributed databases). Not really the EFS in this case. We could apply this to certain stateful domains inside PK though. For example the OCL concepts and ACL.

tegefaulkes commented 1 year ago

RIght, just need to squash and checks. Then this will be ready to merge.

CMCDragonkai commented 1 year ago

Did you check the rest of the tests you replaced to ensure that no predicates are missing? I didn't go through all of them.

tegefaulkes commented 1 year ago

Yes, I went through all of the ones I updated.

tegefaulkes commented 1 year ago

Cleaned up the commit history.

tegefaulkes commented 1 year ago

Just waiting on CI now.

CMCDragonkai commented 1 year ago

I've asked about the integrating timeouts between fast check and jest: https://github.com/dubzzz/fast-check/issues/3084

MatrixAI / js-encryptedfs