Open kentonv opened 6 years ago
Note that I think the choice for Sandstorm to use dump.json was technically mine: It was the default behavior and it "just worked" when I experimented with packaging EtherCalc. My opinion at the time was a JSON file was probably a reasonably fast and simple storage format for granular use as it was. (And since it's been a few years, and Sandstorm users losing EtherCalc files seems rare, I suspect this is a significantly fluke-ish behavior.)
EtherCalc does support other storage solutions, particularly Redis, so with a migration, that might be a decent choice. The dump.json code looks to be in here: https://github.com/audreyt/ethercalc/blob/master/db.js
Using dump.json
is perfectly reasonable, as long as updates are atomic.
I saw that file before but it doesn't seem to write dump.json
. Ah, looking closer, it appears the file format changed recently, and now many different files are written rather than writing a single dump.json
. Here's the source before these changes:
https://github.com/audreyt/ethercalc/blob/49a314172e8b18ba3eba54e1a306f3e499d61233/db.js#L104-L107
Indeed, we see that dump.json
looks like it is written non-atomically -- the code overwrites the existing file directly, so there's some point in time where the file is in a half-updated state.
The new code writes several different files, but unfortunately still writes each file non-atomically:
https://github.com/audreyt/ethercalc/blob/4b5a993a75b6dfff9b779f067c9c3de42c8a7207/db.js#L157-L159
This code needs to be updated to the three-step process I described before, otherwise data loss is possible (whether or not running on Sandstorm).
Note db.js is auto generated from https://github.com/audreyt/ethercalc/blob/4b5a993a75b6dfff9b779f067c9c3de42c8a7207/src/db.ls
There is an online livescript to javascript converter here: http://livescript.net/ handy for testing.
I was getting this problem a lot until I upgraded web sockets A crash was being caused by zappajs@0.5.0 - it uses ws@0.4.31, the bug is in ws@0.4.31 see https://github.com/audreyt/ethercalc/issues/554
thanks to the proposed fix - looks worth adding. @kentonv are you able to add it?
Ah, looking closer, it appears the file format changed recently, and now many different files are written rather than writing a single
dump.json
.
That's right. It was changed in PR #597 to address issue #414. The new version should auto-migrate a previous installation using dump.json
to the new layout on start-up:
https://github.com/audreyt/ethercalc/blob/fc0c1e96881a88ad6f8f2eed0b8af492f71aaee7/src/db.ls#L91-L94
Besides just a change in format (where each sheet's data is saved in loose files and in the most natural format), it also has the advantage that saves are optimized to only write the files of sheets that are being edited. With the one monster dump.json
strategy, the entire file with all sheet data would have to be re-written on change to any one sheet. The risk of corruption in the event of a failure was therefore high for all data. With the new layout, corruption via partial writes (on power out) is still possible but would be limited to one file/sheet and is probably acceptable for most deployments!
We could use write-file-atomic but I would make it optional (via its fsync
Boolean option), off by default and with a note in README on how and when to turn it on. File/Disk write caching by the OS is there for a good reason and forcing a flush to disk on every write could affect throughput and performance, especially since all disk I/O is done synchronously. By making it optional, each installer can make a conscious decision about the balance between performance and reliability that they'd like to strike.
Sandstorm doesn't currently run an EtherCalc new enough to have the new file format. Also, note that Sandstorm instances of EtherCalc only can ever contain a single sheet, so the risk has not changed in that regard for Sandstorm. (Actually, this is an example of how Sandstorm's granular nature was protecting users from the increased potential of corruption from the EtherCalc's old way of saving the data.)
I'm a pretty big fan of having options, personally, so that sounds like a solid idea. :) If this gets implemented, we'd just need to make sure the Sandstorm branch had that option enabled, and then get an updated package out.
@atifaziz Would any Ethercalc user really choose to accept a risk of catastophic data loss in order to get better performance? Keep in mind that the failure mode here is not just losing recent writes, but losing the entire document. It seems unlikely to me that anyone would want that behavior, so it seems like the wrong default. If people feel the performance is not good enough, the right answer is probably to switch to a real database, which can use journaling or other techniques to get better performance without risking data loss.
So I'd argue that fsync should be on by default. But, if not, we'll just have to turn it on for the Sandstorm package...
@atifaziz are you sure this would impact caching? Does a rename or delete always trigger a flush? But in either case, I think this could be tested to see if it has any speed impact. Every cell update triggers a write, so updating many cells would be reasonable. Suggest testing on a non-SSD machine.
@kentonv thoughts on your design ... atomic write design - expanded
` // write dump file Write the entire file to a temporary location (e.g. dump.json.next). Perform an fsync() after all data is written. rename() the temporary to replace the original file.
// read dump file IF dump.json is missing and dump.json.next exits THEN rename() the temporary to replace the original file. read dump file `
IF dump.json is missing and dump.json.next exits THEN rename() the temporary to replace the original file.
This isn't needed. Replacing a file with rename()
is atomic, meaning that (on a proper journaling filesystem) either the operation completes in its entirety or it doesn't happen at all. So, there's no case where dump.json will have been deleted but dump.json.next won't have been renamed to replace it.
(If you do see dump.json.next exists but dump.json does not, I would argue the best thing to do would be to error out, so that the admin can go inspect and repair. Since we really don't know what could cause this situation, we can't really say how to recover safely.)
@kentonv you are right - I did not think that all the way through.
@kentonv would you not need to delete dump.json before the rename - making it two operations - and so the following would be needed:
// read dump file
IF dump.json is missing and dump.json.next exits
THEN rename() the temporary to replace the original file.
read dump file
@kentonv ignore than - nodejs rename overwrites existing files:
https://stackoverflow.com/questions/21219018/node-js-does-fs-rename-overwrite-file-if-already-exists
also here: https://nodejs.org/api/fs.html#fs_fs_rename_oldpath_newpath_callback
fs.rename()
calls the operating system's rename()
system call. POSIX requires that this replaces the file atomically. So, there should be no point in time where the old file is deleted but the new one hasn't been renamed yet.
If you chose to delete the old file first before renaming the new one, then I agree you'd need the extra check on read to see if dump.json
is missing but dump.json.next
exists. But, because of POSIX's rules for rename()
, you shouldn't need that.
@kentonv Am happy to merge such a change in if you write the code. Is this possible at all?
Sorry but I don't have time to learn how to hack on Ethercalc. :(
since i dont know how ethercalc was coded . So now whats the safest way to use ethercalc. Im currently using it on a local network . Are my data safe from data loss?
@enemali I've used EtherCalc for years and have not lost data, the chances of encountering this issue are relatively low. You should, however, regularly make backups of your EtherCalc data, however you currently store it on your machine.
@enemali use redis and backup the redis DB. Or backup the data files: dump.json OR dump/*
Dropbox would be an option if you use datafile. just set OPENSHIFT_DATA_DIR to a drop box folder and move your dump file(s)
This would be better if it was added to the wiki
A Sandstorm Oasis user reported that an Ethercalc grain spontaneously lost all its data, becoming blank. I was able to restore the user's data from daily backups, but the fact that this happened suggests that there is a bug in Ethercalc.
I note that Ethercalc on Sandstorm writes the entire spreadsheet as a single file
dump.json
. I was unable to find the code which writes this file. In order to be safe from data loss, it needs to perform an atomic write, as follows:dump.json.next
).fsync()
after all data is written.rename()
the temporary to replace the original file.This sequence will ensure that even in case of a power failure, in the worst case, the old data will still be present. Note that when Oasis' grain scheduler needs to kill a grain container, the effect is very similar to a power failure. So, being safe against power failures is necessary to avoid data loss on Oasis.
Additionally, Ethercalc should make sure that if it reads data from disk that it is unable to parse, it reports the error to the user and does not overwrite the data. The user can then download the data (using the "download backup" button in Sandstorm) in order to manually inspect and possibly recover it. In the case we observed, when the Sandstorm user downloaded a backup of their data,
dump.json
was already empty, indicating Ethercalc had overwritten the data at some point.