WorldBrain / Memex

Browser extension to curate, annotate, and discuss the most valuable content and ideas on the web. As individuals, teams and communities.
https://worldbrain.io
4.42k stars 336 forks source link

Backup change log doesn't log every change #681

Open poltak opened 5 years ago

poltak commented 5 years ago

If reporting a bug:

  1. Can you describe the problem and bug in more detail?

Originally explained in this slack thread: https://worldbrain.slack.com/archives/CBFJ8V490/p1546572809002800

It effectively means that our backups may not be an accurate representation of the data stored in the DB.

  1. How can we replicate the issue?

Adding tags method (in bg script console):

  1. backup.recordingChanges = true
  2. addTags = (url, tags) => storageMan.createObjects('tags', tags.map(name => ({ name, url })))
  3. addTags(MY_URL, ['a','b','c','d','e'])
  4. check backupChanges collection contents

Replace MY_URL in step 3 with the URL of a page that you have indexed.

In step 4, you should see 5 entries with {..., operation: 'create', collection: 'tags'} for each tag you added in step 3. However, a lot of the time only 2, 3, or 4 entries are created. This is the issue.

  1. Expected behavior (i.e. solution)

All entries get added for each change.

  1. Error stack (from extension crash page)

  2. Other comments

Other ops that modify multiple docs produce similar results where not all changes are recorded, like a multi-delete.

I have spent a bit of time looking into this, but yet to find a reason 🤕

One very weird thing I've noticed is if you add a console.log call somewhere in the triggered code (like here: https://github.com/WorldBrain/Memex/blob/master/src/backup/background/storage.ts#L40-L45), then all the changes get recorded :S

If, instead of console.loging, you create some class props in BackupStorage and mutate them each time registerChange is called and successfully runs its putObject call (eg: count the number of calls, or keep track of each call's args), they get set as expected. It's just the data in the DB that ends up in an unexpected state. I originally thought that maybe the Dexie table hooks aren't always getting triggered, but this indicates to me that they are.

ShishKabab commented 5 years ago

Seems to be a timing issue or something like that. I've checked in the Dexie docs whether we can return a promise, but that doesn't seem to be the case. One thing I've noticed though, is that the hooks get a transaction passed in: http://memex.link/BEf1lhEHQ/dexie.org/docs/Table/Table.hook(%27updating%27)

Maybe if we make a Dexie-specfic option to the CRUD methods in the Storex backend called dexie.transaction and we use that one to write to the change log, that changes something?

The weird thing apparently is that delaying the write (console.log introduced a significant delay) solves the issue. This may mean that we need to queue the backup changes internally. Maybe this also solves the need of having to add the backupChanges table to every transaction. However, it'd be awesome if we can include the backup change write in the transaction, so we're absolutely sure every modification also gets logged.

Maybe one thing that messes things up is the async stuff. If Dexie continues with it's normal life after the synchronous execution of the hook is completed (meaning something returns a promise), that could introduce the timing issue. That may be a potential source of the problem?

poltak commented 5 years ago

Thanks for looking into this. You got further than me.

Maybe if we make a Dexie-specfic option to the CRUD methods in the Storex backend called dexie.transaction and we use that one to write to the change log

What would that option change in the storex backend code?

It definitely feels like a timing issue. I saw that Promises returned from the hook subscriber aren't handled by Dexie, although I don't really understand how that would mean some of them don't get DB access (from what I saw debugging: all subscribers are triggered, and all the DB ops run by the subscribers resolve rather than rejecting).

I tried playing around with the transaction arg in the subscriber, but there doesn't seem to be much we can actually do with it though apart from lock, unlock and abort. Locking it, running our additional op, then unlocking doesn't seem to change it. Do you know what the docs mean by "A subscriber may use the given transaction object to do additional operations on the database"?

This may mean that we need to queue the backup changes internally. Maybe this also solves the need of having to add the backupChanges table to every transaction.

I thought about this too before, although tried not to in favour of trying to find a simpler solution not involving adding new stuff. Though thinking about it more now, it could be pretty easy to implement. Our old prio queue that we used for all DB ops a while back was just an minor extension of this async queue package: https://www.npmjs.com/package/queue That would work if we still can't find a way to get passed this

ShishKabab commented 5 years ago

I thought one would be able to pass the transaction object to different modification operations, but it seems I'm mistaken (probably confused it with Sequelize.) That's what I meant by passing in the transaction as an option.

It kinda sucks that the API just stinks: async ops within transactions is something we've been doing for at least 20 years in the SQL world. The only things I can think of now: 1) making sure things happen in-order, avoiding awaits. In theory promises immediately get executed, so it should be possible do have the write be done before we return from the hook. This means that you place a console.log statement at the beginning and the end of the hook, and inside the createObject function, and see if they indeed execute in the right order. 2) queuing up these things and doing them outside of the transaction, in the hope that no data gets lost.