Remove the daemon by using a pure-Go port of SQLite

xiaq commented 3 years ago

In #1216 @krader1961 suggested using this sqlite package instead of boltdb so that the database can be accessed from multiple processes without needing a daemon.

The package is quite mysterious though, as the sqlite project is implemented in C and that package appears to contain Go sources translated from sqlite's original C source, but I couldn't find that translation process documented anywhere.

xiaq commented 3 years ago

I filed an issue: https://gitlab.com/cznic/sqlite/-/issues/41

It's also unclear how mature the project is. In an issue from 4 months ago (https://gitlab.com/cznic/sqlite/-/issues/27) the author said that the priority is "completeness and correctness", meaning that there are likely bugs and missing features.

krader1961 commented 3 years ago

The project update on 2020-07-26 (v1.4.0-beta1) reports:

630 errors out of 200177 tests on Linux 64-bit little-endian Unfreed memory: 698816 bytes in 322 allocations

There are two questions:

1) What is the current state of those test failures and memory leaks five months later?

1) Are the unresolved issues likely to be problematic for Elvish or are they in niche areas of the code that Elvish wouldn't exercise?

Of the open issues the only one that would be a problem for Elvish is "install issue on windows 10 with go 1.15" and it looks like there is someone actively working on making that package work on Windows 10.

Note: I do not care if a Go port of SQLite3 is used. I'd be equally happy to see a key/value DB written from scratch in idiomatic Go used if it supported access to the DB by multiple processes so the daemon could be eliminated. I say that because I have also experienced problems starting the daemon; even with elvish built from the master branch. Specifically, I initialize a bunch of tmux sessions in parallel to different machines. That frequently results in daemon initialization failures. I had to resort to adding a one second sleep after launching the first elvish shell on each machine to eliminate those daemon failures.

Sadly, the searches I've performed did not yield any promising alternatives to BoltDB. For example, it appears that all the key-value stores documented at https://github.com/gostor/awesome-go-storage only allow a single process to access the database. Which is why I'm still inclined to propose using a pair of text files (for commands and location mode data respectively) and using the algorithm from bash, fish, or another shell for accessing and mutating the contents of each file.

krader1961 commented 3 years ago

See also issue #596 about the BoltDB being corrupted. Switching to SQLite3 or a DB specifically designed for the needs of an interactive shell wouldn't avoid that problem when the DB is on a shared (i.e., network) filesystem like NFS. But a bash/fish like history mechanism would probably make it easier to deal with the conflict without a panic.

krader1961 commented 3 years ago

See also issue #126 which asks if a "pure elvish" interactive history would be preferable to an opaque binary blob that depends on a complex dependency that mandates an Elvish daemon because it only allows one process at a time to access the database. In this context a "pure elvish" database is one that is a sequence of items representing each command or location change. Each item should be in a well defined structured format such as JSON or Google protobuf text format.

krader1961 commented 3 years ago

FWIW, the pure Go sqlite was recently updated to officially support Darwin (macOS). While it doesn't officially support other BSD flavors (e.g., OpenBSD) that is unlikely to be a problem given the support for Linux and Darwin. That leaves Windows as the only platform we care about that is not officially supported and might have problems.

krader1961 commented 3 years ago

Another reason to eliminate the daemon: Its test coverage is abysmal. For example, pkg/shell/runtime.go currently has 41.7% coverage. AFAICT, the only reason it isn't 100% is due to the support for an elvish daemon that exists solely to provide serialized, atomic, changes to the two interactive histories: command and location.

krader1961 commented 2 years ago

The pure Go sqlite project now has support for MS Windows; albeit with one unresolved issue. From the main project page as of this date:

The windows/amd64 has currently experimental/preview status. Tcl tests report an unresolved yet memory leak, see https://gitlab.com/cznic/sqlite/-/issues/23 for more details.

It should now be possible to replace the dependency on the BoltDB project with the SQLite project. Thus making it possible to eliminate the Elvish daemon. The first change should be to retain the daemon but have it use the SQLite DB if it exists. Possibly creating it using the content of the BoltDB data store if the latter exists. A subsequent change would then move that functionality from the daemon to each Elvish process so that it was no longer necessary to start a daemon process. Obviously there are many details I haven't addressed such as what, if anything, to do about pre-SQLite enabled shell instances that are still talking to a daemon that updates the, now obsolete, BoltDB data store.

krader1961 commented 2 years ago

WooT! The pure Go sqlite project now officially supports every platform we care about, AFAICT. From that project's change log:

2021-11-13 v1.14.0:

Support windows/amd64. This target had previously only experimental status because of a now resolved memory leak.

It's time to start hashing out the details of transitioning from the current daemon+BoltDB architecture to a daemon-less Sqlite architecture. Assuming, of course, we don't decide to implement a solution for issue #126 similar to how other shells (both old, like bash, and modern, like fish) handle their command histories without requiring a database like BoltDB or Sqlite. Switching to a pure Go implementation of Sqlite is going to be the easiest path forward. It does, however, simply replace one large third-party dependency with a different one. Whereas a "pure Elvish" solution would eliminate the current BoltDB dependency and replace it with code we have total control of.

krader1961 commented 2 years ago

FWIW, recently announced changes at https://pkg.go.dev/modernc.org/sqlite suggests it is now a viable replacement for BoltDB that Elvish currently uses and would make it possible to eliminate the Elvish daemon. So, per my previous comment, there probably still needs to be a discussion about using a simpler file format such as used by older shells (e.g., ksh, bash, fish) versus a flexible database engine. Using a database engine like Sqlite3 (or BoltDB) is obviously attractive since it outsources many problems to a separate package we don't have to maintain. However, it's also overkill with regard to the needs of a shell like Elvish. At the end of the day my primary concern is eliminating the need for a daemon to manage interactive history. I don't really care if it is something like the "CGo-free port of the C SQLite3 library" or a file format customized to the needs of Elvish (similar to those of older shells). Using the cgo-free sqlite3 package is the easy solution but will make the elvish binary larger than writing code that manipulates a simpler file format. On the other hand we're outsourcing a lot of complexity to a robust package like the cgo-free sqlite3 package which may easily justify its increase to the size of the elvish binary.

krader1961 commented 2 years ago

@xiaq, I'm curious what you currently think about the trade-offs between using a third-party DB that is overkill (in the sense that much of its functionality won't be used by Elvish) versus something resembling the history file formats used by legacy shells like Ksh and Bash, and even more modern shells like Fish. We certainly don't want to use the ksh/bash style history file format. The Fish history file format is better since it is (in theory) extensible but I'd prefer JSON or text protobufs rather than YAML. The main thing is that a Fish style history file doesn't require a daemon, increases the size of the Elvish binary a fraction of what BoltDB or SQLite does, and results in a history file that is relatively human friendly. Too, if we use JSON to encode entries then Elvish script can manipulate the history.

krader1961 commented 2 years ago

FYI, I just built Elvish with an import of the modernc.org/sqlite package (and keeping the BoltDB package import). On my macOS ARM64 system that increased the size of the elvish binary from 9.3MB to 15MB; a 61% increase in size. That is going to be hard to justify but may be worth it to eliminate the Elvish daemon. I am still of the opinion that we should switch to a pure text database file as discussed in issue #126. Other shells, including modern shells like Fish, manage to use simple text database files. Yes, those shells only support an interactive command database. But it isn't hard to generalize the mechanism used by Fish, Bash, etc. to also support directory history and shared vars.

Update: I built ./cmd/nodaemon/elvish which shows that the daemon and BoltDB adds ~1.2MB to the size of the binary. So switching to sqlite would actually increase the size just 4.5MB (5.7-1.2). Which is still almost certainly an unacceptable increase in size.

krader1961 commented 2 years ago

Also, using a simple text file database would make the u-root project happy by decreasing the size of the Elvish binary.

prologic commented 1 year ago

I would also prefer a simple text file too 👌

brunoroque06 commented 1 year ago

There is also this option as an alternative to BoltDB: nutsdb.

I think the question DB vs TextFile has a lot to do with how one writes/updates the history, meaning the modeling used. I will ignore querying, as I think it can be safely done in memory (can it?).

Example of 2 models:

Model 1: id, cmd. Here we just append to the file, no? Deletes can be expensive time-wise, because we have to shift the file content (worse with low ids). I would say a TextFile is okay. Again, deduplication can be done in memory?
Model 2: id, cmd, first_executed, last_executed, count. This is trickier, because if we order by id, we would constantly get a lot of modifications in the middle of the file. I would say a DB would be best.

krader1961 commented 1 year ago

@brunoroque06: I don't see the point in switching to nutsdb since it doesn't support multiple processes simultaneously accessing its database. Thus, we'd still need the Elvish daemon. There are well established implementations for simple linear history files That can be updated by multiple shell instances. The one that is probably easiest to adapt to Elvish is the one used by the Fish shell.

krader1961 commented 1 year ago

There is one notable difference between the Elvish and Fish shell command history model. The Fish shell history does not assign a unique sequence number to each command. The Fish history delete command takes one or more arguments that match the text in the command history. If the --exact option is used it simply deletes matching commands. Otherwise it displays the matching commands and requires the user to select which instances they want deleted. Which behaves as if the user used history delete --exact with the full text of the selected command. Whereas the Elvish store:del-cmd command takes a unique sequence number. I think the Fish behavior is preferable because it is more user friendly.

Also, the Fish command history always elides duplicate entries whereas Elvish retains duplicates. This is another area where I think the Fish model is preferable. See issue #1051 where I had to explicitly add a edit:command-history &dedup option to get the behavior most users expect and want. If the Elvish command history implicitly deduplicated the results that option could be removed.

See https://fishshell.com/docs/current/cmds/history.html for more information about the Fish approach to command history.

prologic commented 1 year ago

FWIW shell history for me is just a file in $HOME/.history. I use it when I hit ^R and type some starting prefixes. That's it. I almost never touch the file, I almost never prune it and I never query it beyond what ^R does in most shells.

brunoroque06 commented 1 year ago

@krader1961 I think nutsdb supports multiple processes simultaneously: https://github.com/nutsdb/nutsdb/issues/100 . Yesterday I only saw that it has transactions, so I assumed it supports concurrent access. I have not tested it yet.

I agree that it is better to have an id for each history entry. But then, what about equal commands? Should they be duplicated, and have 2 distinct ids?

Just checked both fish's and nushell's history file. fish's model has cmd, when, paths. The fish history might not display duplicates like you described, but the file has duplicates. nushell has cmd only. So both approaches will only be appending entries to the file; only deletes will modify file entries.

krader1961 commented 1 year ago

I think nutsdb supports multiple processes simultaneously: https://github.com/nutsdb/nutsdb/issues/100 . Yesterday I only saw that it has transactions, so I assumed it supports concurrent access. I have not tested it yet.

Every other package like nutsdb I've looked at does not support concurrent access by multiple processes. I downloaded the source and the word "concurrent" doesn't appear anywhere. Usually the reason these types of key/value store packages support transactions is so that a single process can support concurrently executed requests from clients via RPC. So, yes, it undoubtedly does support concurrent access in that sense. I would be surprised if it supports multiple processes operating on the same DB file simultaneously. In any case, I don't think it's worth switching to another key/value store with a complicated API just to eliminate the daemon.

The fish history might not display duplicates like you described, but the file has duplicates.

The duplicates are eliminated when the shell "compactifies" the stored history (which it does periodically and when the shell exits). Its internal API also elides duplicates when scanning the history to account for the fact that multiple fish processes can append identical commands to the history between compactions.

prologic commented 1 year ago

Why can't we just have a simple file? 🤔

krader1961 commented 1 year ago

Why can't we just have a simple file?

We can. It's just that no one has invested the time to design an implementation and write the code. Not to mention defining a migration plan from the current BoltDB format to a flat file. As a starting point I propose converting the relevant Fish shell history code (written in C++) to Go. The Fish shell history code is robust and has been in use for a very long time. So using its implementation as a template means we don't have to design a solution and write code from scratch. And rather than YAML (used by Fish) we would, obviously, encode the entries using JSON. Also, I don't see any reason two history files would be needed (one for commands and one for location data). It should be possible to store both in a single file.

prologic commented 1 year ago

I might fork Elvish then and try to do this, I have enough vested interest in doing so as I'm trying to use a half-decent shell written in Go for my GoNix project, but I really don't want nor need extra complexity lie a database or a daemon just for history. Make sense? 🙏

krader1961 commented 1 year ago

@prologic: Note that shared variables depends on the current database mechanism. Even if you ignore that feature you still need to deal with location history, not just command history. The change you propose to implement is extremely complex if you want to have your change(s) merged upstream. If you don't care about merging your changes upstream you will be locked into an increasingly out of date Elvish binary since the Elvish source is undergoing rapid changes that will make it increasingly difficult for you to perform a git rebase master that retains your local changes.

The Elvish community needs to decide whether the "shared var" mechanism should be retained. Eliminating the shared var mechanism will make it orders of magnitude easier to replace the current BoltDB based data store with a simple linear text file.

krader1961 commented 1 year ago

The Elvish community needs to decide whether the "shared var" mechanism should be retained. Eliminating the shared var mechanism will make it orders of magnitude easier to replace the current BoltDB based data store with a simple linear text file.

It occurred to me shared vars can be implemented using the filesystem namespace. Simply URL encode (i.e., percent encode) each shared var name and store the associated value in a file anchored at ~/.local/state/elvish/shared-vars. Then create a file for each var, whose name is URL encoded, containing the associated value. Note that this approach doesn't even require the store module to work. It simply requires a couple of str module functions to encode/decode URL (percent) encoded strings. The existing store module functions can be trivially implemented in terms of that implementation. If we do this it might be worthwhile to augment the mechanism to allow non-string Elvish types to be stored using either JSON or Google protobuf encoding of the value (the current shared var implementation only allows strings as values). On the other hand one could argue that non-string values are the responsibility of the user to encode/decode using functions like repr and that non-string values have even less utility than the shared var mechanism (which itself seems to be seldom, if ever, used).

krader1961 commented 1 year ago

Per discussion on the IM channels the project owner agrees that support for shared vars can be dropped. So I'll create a pull-request to do so. That simplifies the problem of replacing BoltDB with a flat-file solution; such as used by Bash and Fish. I still recommend using the Fish implementation (written in C++) as a template for a Go implementation suitable for Elvish. This will necessitate one significant change: Eliminating persistent command sequence numbers. This means eliminating the store:next-cmd-seq function (which is unused AFAICT) and changing how store:del-cmd works (also unused AFAICT) to take a command string rather than a sequence number.

saolof commented 8 months ago

Independently of using sqlite as a history file: if the pure-go sqlite is included as a dependency then it can do double duty as a backend for a builtin sqlite module.

To see how other alternative shells do this, nushell has a really nice sqlite integration that effectively makes sqlite databases human readable, though largely because nu is table oriented to begin with: https://www.nushell.sh/commands/docs/query_db.html https://www.nushell.sh/commands/docs/into_sqlite.html

krader1961 commented 5 months ago

Independently of using sqlite as a history file: if the pure-go sqlite is included as a dependency then it can do double duty as a backend for a builtin sqlite module.

That would be interesting but would greatly increase the size of the Elvish binary. I don't think the size increase can be justified and such functionality is arguably far outside the scope of a shell like Elvish. I'm still firmly of the opinion that replacing the current BoltDB storage with a more traditional flat-file format for shell history is the way forward.

krader1961 commented 4 months ago

A few days ago I started working on replacing the Elvish daemon, and dependency on BoltDB, with two flat files using JSON encoding of the command and directory history. This is similar to how most popular shells handle interactive history. While shells like Ksh, Bash, Zsh and Fish don't use JSON to encode the history, conceptually it's the same idea. They use a flat file (i.e., not something like BoltDB or MySQL) that is appended to whenever a command is executed. Or, in the case of Elvish, a directory change occurs. I've got the basic plumbing working.

The flat files, even using JSON maps (which adds a lot of overhead compared to simple JSON arrays), are ~3x smaller than the BoltDB file. Also, the new history file format makes it trivial to include additional information in a backward compatible manner. Such as the time the command was run, when the directory change occurred, or the CWD when the command was run.

I still have a lot of work to do. For example, correctly handling symlinks to the files and writing unit tests. I hope to have a usable implementation (albeit without unit tests) for people to test drive in a few weeks.

The main problem I haven't resolved to my satisfaction is how to transition from BoltDB to flat files. I converted my interactive history using a trivial Elvish script that leverages the store module. My current thinking is to provide a program independent from the elvish program to handle the conversion. That allows elvish to be built without support for the daemon and BoltDB code immediately. It also means users who do not upgrade their Elvish version within whatever deprecation window we provide for the current BoltDB based history can still convert their interactive history for the foreseeable future.

We could also have the elvish program automatically convert the history. But at some point, such as six months after changing the history storage mechanism when the legacy support is removed, that would no longer be an option. Meaning users who didn't upgrade within that transition window would lose their interactive history. Which is why I favor a clean break utilizing a separate tool to migrate the history. Using that tool requires a bit more effort by an Elvish user but also provides more benefit and a longer transition window. The elvish program itself can tell users what to do to migrate their interactive history if it finds a legacy BoltDB history file even though it can no longer read/write those files.

prologic commented 4 months ago

Nice! Looking forward to this landing 🛬

xiaq commented 4 months ago

@krader1961 Re how to transition, I'd say ideally both: the first release after the transition should ship with the upgrade functionality built-in (which can get removed in the next release), and a standalone program for users who skipped that version.

I'd be interested in seeing your implementation. One big advantage of using a production-grade database (whether SQLite or BoltDB) was that I never had to worry about data integrity (in case of concurrent writes and loss of power) and performance; getting these aspects right can impact the implementation quite a lot.

If/when you are interested in submitting a pull request, I'd suggest keeping the scope of your change as small as possible. This is a pretty significant change, and it's easier for me to merge a small (even if incomplete) implementation and work from there than it is to merge a big change and have to revisit a lot of the implementation decisions.

krader1961 commented 4 months ago

I'd suggest keeping the scope of your change as small as possible.

Understood, @xiaq. As a professional SWE I've done thousands of code reviews and there is little I dislike more than a huge change that could be easily broken up into a sequence of smaller changes. For example, I'd like to change the public store API and the internal API since sequence numbers aren't necessary and add complexity to the flat-file store. But my first change will stick to the existing external/internal store API.

krader1961 commented 4 months ago

I'd be interested in seeing your implementation. One big advantage of using a production-grade database (whether SQLite or BoltDB) was that I never had to worry about data integrity (in case of concurrent writes and loss of power) and performance; getting these aspects right can impact the implementation quite a lot.

In my experience few databases correctly handle corner cases such as loss of power and if they do so there is always a decrease in performance and greatly increased API complexity (often due to mechanisms such as two phase commit). Also, a shell like Elvish doesn't need the guarantees provided by databases which correctly handle those corner cases. A flat-file database works well enough for shells like Bash, Ksh, Zsh, and Fish by relying on the OS to correctly handle writes from multiple processes that have opened a file in O_APPEND mode and forcing the change to be synced to persistent storage. I have personally, in over four decades of using shells on Unix systems, never seen any of the legacy shells that use a flat-file database for their interactive history have a corrupted history file. Neither can I recall reading about someone dealing with a corrupted interactive command history file by those shells. I have no doubt there have been instances where a command was not recorded to the flat-file database even though it was executed when an event like a loss of power occurred. But that doesn't justify the overhead of a database such as DB2 or Oracle to record shell history.

krader1961 commented 4 months ago

Re how to transition, I'd say ideally both: the first release after the transition should ship with the upgrade functionality built-in (which can get removed in the next release), and a standalone program for users who skipped that version.

After sleeping on it I've concluded a hybrid approach is optimal. Since a standalone program will eventually be needed it should be used by the first official release of this change. That is, the first official release of this change in history format should include, and implicitly run, a program that converts interactive history to the new format. There shouldn't be any need for a user to explicitly run that program now or in the future. The elvish binary will automatically run it if a flat-file database doesn't exist but a legacy BoltDB database does exist. This means we can eliminate the overhead of the daemon and BoltDB code in the first release that incorporates this enhancement. It also means we can retain the transition code for a long time (e.g., over a year) since the cost (size and execution overhead) of that transition code in the elvish binary is negligible. The only cost is the history migration program and its dependency on the BoltDB project. Which is something we can accept for several years.

krader1961 commented 3 months ago

I have two commits that implement the core change. Specifically, automatically converting the legacy BoltDB store with a JSON flat-file store and performing a reasonable set of tests on the code manipulating the new history store format. The first change introduces the new JSON flat-file store -- suitable for use by someone who is using Elvish for the first time. The second change adds an automatic migration from BoltDB to the new flat-file format. I still need to deal with details such as integrating the build and packaging of the new elvish-history-migration program into how the Elvish packages are constructed. But my local build and testing suggests the transition from BoltDB to a flat file is within reach.

krader1961 commented 3 months ago

I have modified the Elvish script I use to build and test Elvish on all of my VMs to build and install the new history migration program in addition to the Elvish binary. It has successfully migrated the BoltDB managed history to the new flat file format on all of the VMs and my non-VM systems; which includes several flavors of Linux, macOS, FreeBSD, and Windows.

I now have three commits:

1) Implement the new flat-file persistent store. 2) Implement a program, and automatically run it, to migrate the BoltDB persistent store to the new format. 3) Remove the code that implements the daemon.

The resulting elvish binary, on my primary macOS system (Apple Silicon aka ARM64), is 16% smaller (it went from 11183938 to 9415026 bytes). That decrease in size will be slightly smaller when I implement the final change to implement compacting the persistent store. But I expect that to be on the order of a 1% increase. Overall test coverage is unchanged at 94.9% for both builds.

P.S., I'm not entirely happy with how the history score for directories is handled. In both the legacy BoltDB and the new flat-file format the raw float64 is stored. Since every platform Elvish currently supports uses IEEE754 floats that shouldn't be a problem. Still, it results in values like 28.897503184159998 being captured in the persistent history and hardcoded in unit tests. I'm worried that quirks in a platform's floating point implementation (especially on RISC systems like ARM which may not implement floating point ops in hardware) may result in slightly different values that could break the unit tests. It also includes a lot of precision that is probably unwarranted. TBD is whether a different representation of the "score" is preferable.

krader1961 commented 2 months ago

I've implemented a fourth commit that "compacts" the history flat-file under certain conditions; e.g., when the number of deleted commands or directories exceeds some threshold. This requires deleting the current history flat-file and creating a new one. I've been using it for a few days and have been trying to break it. So far it's working fine. Which is to say, when one Elvish instance compacts the history the other instances correctly switch to the new file when recording their history and will correctly reload the new history store when edit:history:fast-forward is executed. I can even manually replace the history file (e.g., rm ~/.local/state/elvish/elv.hist; touch ~/.local/state/elvish/elv.hist) and the existing Elvish instances correctly switch to the new (empty) history file. I need to write some unit tests for the new code.

I haven't tested the compacting behavior on Windows. I expect it will fail since it requires renaming/deleting the history file which might be in use by other Elvish processes (the non-compacting unit tests all pass). I expect the compacting behavior will fail due to the Windows file sharing model which does not allow renaming/deleting files opened by other processes for modification. TBD is what to do on Windows in that case. The simplest "solution" at this juncture is to simply disable compacting the history store on Windows but I hope to find a better solution.

krader1961 commented 2 months ago

I haven't tested the compacting behavior on Windows. I expect it will fail since it requires renaming/deleting the history file which might be in use by other Elvish processes (the non-compacting unit tests all pass).

Sigh, yes, rewriting the flat-file store fails on Windows because the current history file is in use by another Elvish process. This is from a debug print I added to my change:

WTF rewriteHist() rename "elv.hist.tmp" => "elv.hist": rename elv.hist.tmp elv.hist: The process cannot access the file because it is being used by another process.

That rename succeeds on all of my Unix systems but fails on Windows due to its file access model which prohibits some operations if a file is in use by other processes. It looks like there will have to be some mechanism for communicating to all Elvish processes that they should close their handle on the current history file and reopen its replacement. But that begs the question of what to do if the Elvish history file is opened by a non-Elvish process that doesn't implement the Elvish history file protocol. Since that is only an issue on non-Unix platforms (e.g., Windows) I am currently inclined to just ignore the problem other than ensure Elvish processes behave reasonably.

elves / elvish

Remove the daemon by using a pure-Go port of SQLite #1222