Discussion: database recovery & prevention of loss of data

Fuco1 commented 9 years ago

I first used tagsistant a year or so ago and am still intrigued by the idea. However, my greatest fear is that I will lose/corrupt the database and then all the files are pretty much unusable.

Are there some mechanisms in place for this? If I do say daily backups, how would I recover files which I added during the day and which aren't in the backup?

Basically, for this reason alone I did not fully adapt tagsistant... during the time I tested it I managed to corrupt the db twice in some way. I'm still playing with the idea of building an indexer on top of the original hierarchy, so in case of failure I at least have the normal FS structure.

This got me thinking, would it be possible to store only links (symbolic/hard?) in the .tagsistant folder and keep the data on the disc where they were. I tested adding links briefly and it seems to work, how about an automatic mode of operation that would do it by default?

I promote your software where I can because I am strongly convinced tags are the future (I even wrote an emacs extension to work with it), but most people come back with these exact issues/fears. I think for a widespread adoption it must be addressed reliably.

StrumentiResistenti commented 9 years ago

Recovering information after the last backup

No, currently there's no mechanism to prevent data loss in case of database corruption. Are you suggesting the implementation of a write-ahead log to restore the DB contents? This would be pretty simple to do. A companion tool could then be able to scan the log and re-insert the information lost after the last backup, being provided an id to start with or a date-time to start after.

Saving symlinks inside .tagsistant

You actually can already create symlinks in tagsistant, they work the same way as in a normal filesystem. Your files will stay outside tagsistant but you'll still benefit from the tagging/autotagging/reasoning functions provided by tagsistant. Using symlink is also very fast because deduplication happens on the pointed file path, not on its content.

What can't be done is an 'automatic' symlink mode, because the operation semantics mean what they mean. If you expect to create a symlink to /some/file by doing:

$ cp /some/file ~/tagsistant/store/tag/@

remember that tagsistant will receive an open() or a mknod(), then several write()s and one release(). None of them bears the path of the source file, making impossible to replace this sequence with one symlink() call.

Fuco1 commented 9 years ago

Re recovering: exactly. After I left I immediately got the idea of having a journal and some simple recovery utility to just re-run the operations on the old database.

Re symlinks: right. So that means instead of cp I simply use ln everywhere and it's fine? That would be awesome :)

I would very much like to maybe help out with some tasks, if you have a personal log of things to be done, consider opening issues. I might pick something reasonably simple for start. It's been some time since I've done C, but it's just like bicycling :D

StrumentiResistenti commented 9 years ago

I've built a basic WAL support into sql.c. Basically the file wal is created inside the repository. Each line is formatted as:

TTTTTTTTTT-YYYY-MM-DD-hh-mm-ss: "SQL statement"

where the first field is the timestamp since the epoch, then follows the date and after the colon is logged the SQL statement to be re-issued to update the DB. I haven't written a tool that does this yet, but it is quite easy to provide one that:

split each line by ':' and the first token by '-' to extract the timestamp and the SQL query
discard the line if previous to the provided timestamp
issues the SQL statement

This is so easy to write that could even be implemented as a shell script. I'll try to do it in the next days.

Fuco1 commented 9 years ago

Awesome!

I would prefer something like python over bash as it is more readable (string handling in bash is a bit painful). If you want, I can write the script/program myself.

It can also have an additional feature of splitting the "old" portion of the log and gzipping it for archiving, so that the wal itself doesn't grow to be massive. Although this would probably be better suited to be included in tagsistant itself.

StrumentiResistenti commented 9 years ago

OK, I couldn't resist... this is a basic (read: ugly) importer for an SQLite backed repository:

#!/usr/bin/perl

use strict;
use warnings;

my $repo = shift() || die "No repository provided!\n";
my $tstamp = shift() || 0;
my $wal = "$repo/wal";
my $db = "$repo/tags.sql";

unless (-f $wal) {
    die "Can't open $wal\n";
}

unless (open(WAL, "$wal")) {
    die "Can't open $wal\n";
}

while (<WAL>) {
    chomp;
    my ($stamps, $statement) = split /: /, $_;
    my ($stamp, $year, $month, $day, $hour, $minute, $second) = split /-/, $stamps;

    if ($stamp > $tstamp) {
        print "issuing statement: $statement\n";
        system qq(echo $statement | /usr/bin/sqlite3 $db);
    }
}

close(WAL);

I'm fluent in Perl5, so here it is. If you want to provide your own in Python, feel free. We need MySQL import too, so a DBI based solution would probably be the best one. Moreover, the log can be filtered by timestamp only, while a date to timestamp conversion function would be comfortable. May be integrating all this into Tagsistant as you suggest would be better?

Fuco1 commented 9 years ago

You could maybe store the timestamp of last operation in the database directly, and then if I restore DB from backup, tagsistant would on start check the last timestamp in the log against the one in DB. If DB has older TS, it would ask the user if they want to re-run the log.

StrumentiResistenti commented 9 years ago

I've created the status table with two columns: 'state' (the key) and 'value'. The key 'wal_timestamp' is saved on exit to store the last timstamp included in the database.

I've splitted the WAL into separate files, named after the first contained timestamp. Files are saved inside the wal subdir of the repository. I've also removed the epoch based timestamp at the beginning of each line, because it is useless in this new setup.

I've not yet implemented the WAL restore logic at the beginning of the session (before Tagsistant is mounted), but I plan to do it in the next days.

Fuco1 commented 9 years ago

Hm, what you mean by "on exit". What if the program doesn't close in an expected way? I might run tagistant for 100 days without a restart, then that would mean re-running 100 days of operations? I think we really need to store the very last successful operation, or make snapshots in a known state.

What if I try to re-run operations from WAL which are already present in the database? I don't understand how you can get rid of the operation timestamps, unless you always restore in those batches and for each one you have a snapshot on which to apply the changes.

In the original idea it was the user who made the backups according to their own schedule, but now this might create incompatible combinations (e.g. I have a backup with one WAL file partially applied)

StrumentiResistenti commented 9 years ago

OK, I see your point. Seems like I've modelled the timestamp feature on my use habit. I usually unmount Tagsistant when I shutdown my computer. But what if Tagsistant is used on a network server? The same applies for the timestamp: in my use case when Tagsistant exists no more queries can be saved in the WAL and so the date (up to the second) is enough to pick the queries to be run again. But what if Tagsistant is continuously operating, like on a server, when the backup is done?

I'll restore the timestamps and I'll save the 'last used timestamp' right after a query is saved in the WAL.

StrumentiResistenti commented 9 years ago

I've recoded the WAL with timestamps and implemented WAL syncing at startup. Could you please review it?

Fuco1 commented 9 years ago

I looked at the code now and it looks good. One thing I don't understand is this: it seems that you create a new wal file for each timestamp? That seems overly exessive. I assumed it would be one file per day or something similar, so you can quickly scan the directory and discard all the irrelevant logs. Having new file for each query is an overkill imo, and just imagine how many files there will be after a while---unix can't handle more than 5-10k in one directory very well (but I guess you know that, seeing the store structure of .tagsistant).

Another feature, but this can wait for later, would be to not store the entire query but encode it somehow, because there's going to be a lot of duplication. But maybe just a simple gzip is enough, no need to invent some mad complicated schemes.

StrumentiResistenti commented 9 years ago

Actually it's one file per mount. The timestamp you see is the one of the very first query after Tagsistant has been mounted. So if you don't periodically unmount Tagsistant, you'll end up with just one big file. The idea of rotating WAL files once a day looks good. I'll implement it in the next days.

As a consequence, after rotating yesterday's WAL, Tagsistant could compress it too. And it could compress on unmount the last opened WAL.

Fuco1 commented 9 years ago

Hm, but it calls the g_date_time_new_now_local every time in tagsistant_get_timestamp (which is called from tagsistant_wal) Doesn't that return the current time each time? Or is the time cached somehow? I don't know gnome libs very well.

StrumentiResistenti commented 9 years ago

tagsistant_get_timestamp returns the current timestamp, right. But its result is used to open a new WAL file only if the fd filedescriptor is -1. Otherwise the previously opened filedescriptor is reused (it's declared static at the beginning of tagsistant_wal). The timestamp is then used to prefix each query inside the file. So if you create tag t1 at 2015-07-25-13-40-10-1437824410 and then create tag t2 at 2015-07-25-13-40-43-1437824443, Tagsistant will record a file called .tagsistant/WAL/2015-07-25-13-40-10-1437824410 with the following contents:

2015-07-25-13-40-10-1437824410: insert into tags(tagname, `key`, value) values ('t1', '', '')
2015-07-25-13-40-43-1437824443: insert into tags(tagname, `key`, value) values ('t2', '', '')

Fuco1 commented 9 years ago

Gosh of course! I totally missed that :/. Ok, great :)

StrumentiResistenti / Tagsistant

Discussion: database recovery & prevention of loss of data #7