SUPERCILEX / clipboard-history

Ringboard—the clipboard manager for Linux
Apache License 2.0
143 stars 5 forks source link

Development plan and design decisions #3

Closed SUPERCILEX closed 1 month ago

SUPERCILEX commented 8 months ago

Development plan

Phase 1

Put everything in one bigass crate and commit shitty code. The goal is to get the core functionality 100% working. Develop with a normal cosmic iced app (or just a truly normal iced app if it turns out cosmic apps require the DE to be running).

Phase 2

De-shittify the code, implement the fastest algorithms, cleanup, etc.

Phase 3

Split the core algorithms and data structures into their own crate. There will probably be three crates: core, applet, and app. We can probably call them Gerono because the clipboard is like an hourglass in the sense that all apps will (implicitly) write to it and then any app can get data out. The clipboard is a pinch point.

Phase 4

Final polish, figure out distribution, how to ship, etc.

Design decisions


Stats from my clipboard:

Stats: Stats {
    raw: RawStats {
        num_entries: 5001,
        total_str_bytes: 3701123,
        ops: OpCountStats {
            num_save_texts: 5130,
            num_deletes: 129,
            num_favorites: 1,
            num_unfavorites: 0,
            num_moves: 25,
        },
    },
    computed: ComputedStats {
        mean_entry_length: 721.4664717348928,
        median_entry_length: 18,
        min_entry_length: 1,
        max_entry_length: 585304,
    },
}

stats

import matplotlib.pyplot as plt
import numpy as np

def generate_histogram(file_path):
    # Read data from the file
    with open(file_path, 'r') as file:
        numbers = [float(line.strip()) for line in file]

    # Calculate the minimum and maximum powers of 2
    min_power = int(np.floor(np.log2(min(numbers))))
    max_power = int(np.ceil(np.log2(max(numbers))))

    # Create a histogram with bins as powers of 2
    bins = 2 ** np.arange(min_power, max_power + 1)
    hist = plt.hist(numbers, bins=bins, color='blue', edgecolor='black')

    # Add labels and title
    plt.xscale('log', base=2)
    plt.xlabel('Value (log base 2)')
    plt.ylabel('Frequency')
    plt.title('Histogram of entry lengths')

    # Modify x-axis labels to show bucket ranges
    bin_ranges = [f'2^{int(np.log2(bins[i]))}' for i in range(len(bins))]
    plt.xticks(bins, bin_ranges, rotation='vertical')

    # Add space at the bottom of the plot
    plt.subplots_adjust(bottom=0.15)

    # Display the histogram
    plt.savefig("stats.png", dpi=300)

file_path = 'tmp.txt'
generate_histogram(file_path)
SUPERCILEX commented 8 months ago

Prior art:


This is kind of exciting, I think there's an opportunity to provide a world-class clipboard manager that beats everything else out there in terms of performance and core functionality (as long as we support arbitrary mime types).

Long term, we could support running arbitrary commands on copy, multiple front-ends, and other fancy features.

SUPERCILEX commented 7 months ago

Oooooh, I just had a very interesting idea: instead of combining metadata and data in the same file (which means mmap doesn't help because you still have to stream through the entire file), what if you split data and metadata into their own files? Then metadata contains fat pointers to the data file which can be mmaped. Similarly if we make metadata enums take precisely the same amount of space, then we can also mmap metadata and do simple offset calculations to transmute some bytes into an enum. This makes compaction much more efficient as we don't necessarily have to move any data around.

In essences, this means we're writing a memory allocator because the standard optimizations apply. Instead of compaction, you could keep a free list. Instead of everything in one file, you could keep fixed block sizes around. I think I like this because it should make sharding the log files much faster and basically unnecessary for at least the first few years of people using the clipboard as an 8MiB file can hold ~250K 32 byte entries which would take ~7 years to fill up assuming you copied 100 things a day.

We have two constraints:

This leads to the following solution based on my clipboard data:

SUPERCILEX commented 7 months ago

mmap + append sounds like a big PITA sadly. Might be doable with mmaping more than is in the file or using ftruncate/fallocate. Though maybe we could just avoid ever remapping?

Also probs would need a separate metadata file for favorites if we're not going to read the entire log (which should probably be the goal as this gives us O(1) startup). Though maybe that's not worth it and we should just allocate a huge vec on startup? I don't like the fact that it forces us to hold that stuff in memory rather than letting the OS decide when to load disk pages in.

The deal with deletions is that they're implicitly continuous. If you've reached the maximum number of items/clipboard size, then every entry addition deletes an entry. Since it'll delete the oldest one, that's kind of our worst case scenario for the log because we have to shift everything back. So maybe we should actually bounce back and forth between two logs? Yeah actually I think this is the right solution. If a user says "I want my maxnum items to be 1M", then `lim(t -> infinity) num_items = 1_000_000`. Wait a sec, why not a ring buffer? The reason I couldn't do that before is because data was stored with metadata, but if every entry is the same size, then I can make the log a ring buffer!!! Fuckin genius.

Ok, so the idea would be that you write to a log file until you hit the max num elems and then just wrap and smash whatever's in the back. We'll need a header which notes the write head? I don't think we ever actually advance the back pointer explicitly?

Ops:


Random thoughts on the client/server stuff: it would be really cool to have this be the defacto clipboard manager for linux as this design is shaping up to be world class. I think we can have the server create a delete-on-close named pipe for reading commands. Clients open their own named pipes (as it looks like there's no good way to send file descriptors around, see https://man7.org/linux/man-pages/man2/pidfd_getfd.2.html) and send the server a message with the location of the file descriptor. It's then the server's job to broadcast messages like "new clipboard item" and reply to specific pipes for individual requests like "give me this item". The problem of course with client-server architectures is that it's super inefficient for clients to read data. Here's a stupid idea: what if I just run the client inside the server process? So maybe the server is a main loop with a switch between different GUIs and when you shut down a GUI it just waits to be told to open the next one? Doesn't really work for terminals where you need the server to run outside the terminal process. So maybe you keep the named pipes idea and just have clients do an initial read-only load of the data from the log file and then everything else it gets is replies from the server that say the data is located here, go look it up yourself which would mean mmapping all the data files. And I guess the server needs to notify clients when it should remap files because they grew. Could work and be reasonable efficient. Also for the applet is might actually make sense to not bother receiving data from the server and just do a load from the log each time it is opened. The theory there being that you can't copy stuff while the applet is open (so we don't need to subscribe to changes) and while the applet is closed we don't want it wasting cycles doing GUI stuff anyway. Any user action could be reflected in the UI as a lie that's immediately processed but asynchronously handled by the server. Honestly this seems like a reasonable strategy for all the GUIs? Like do we really care if you're watching the GUI as you copy stuff? Actually yeah maybe we do. So it could be a hybrid thing where you close your pipe when the user isn't looking and then open it to subscribe to events while the UI is open.

SUPERCILEX commented 7 months ago

Actually unix domain sockets are precisely designed for the client/server stuff I want to do and have the same performance according to some random benchmark, so just use that instead of my half-assed multiple pipe nonsense: https://man7.org/linux/man-pages/man7/unix.7.html. Still doesn't solve the problem of not wanting to send data over the channel.

SUPERCILEX commented 7 months ago

Ok so there's also sysv: https://man7.org/linux/man-pages/man7/sysvipc.7.html. Basically it sounds like I can do anything I want: move FDs around processes, shared memory between them, and synchronize their actions. Should free up the design space considerably.

SUPERCILEX commented 7 months ago

Title for the announcement:

Key claim: memory usage is O(1) for normal usage, so you're only limited by disk space (well that and the ID limits, but that's easily increased if it was something we cared about).

Server

The server is a (mostly) single-threaded application as everything we are doing is designed to minimize latency and maximize efficiency rather than worry about throughput. The server is implemented as a connection oriented (SOCK_SEQPACKET) unix domain socket with a core loop that simply processes requests from clients and issues syscalls using io_uring. We will tell people that old kernels and non-linux kernels are not officially supported, though PRs to do blocking I/O may be considered for acceptance.

On a new connection the server tells the client its version. The client checks for an exact match and shuts down if its version doesn't match the server's

The server's goal is to be the one and only place where data is written. On startup, the first thing it will do is open the ring file and acquire an exclusive lock (https://man7.org/linux/man-pages/man2/flock.2.html). Clients only ever open files in read-only mode, so they don't care about the lock. It's only used to prevent concurrent server instances.

The server will be backend agnostic (so wayland and x11) gated by feature flags. When a feature isn't enabled, the implementation returns an error upon creation which a unifying impl uses for broadcast/merging purposes. Looks like this:

struct AllBackends {
  a: Option<BackendA>,
  b: Option<BackendB>,
}

impl Backend for AllBackends {
  // Impl with Option::map and error tuples? Or maybe handle errors internally? Maybe closures?
}

// Maybe use https://docs.rs/arrayvec/latest/arrayvec/struct.ArrayVec.html as message buffer (actually nevermind because clients are backends now)
// New idea: make every client a backend since it can also send add events and request content

Nope, the backends should be in a list as clients and wayland are treated the same.

Each backend will need to return a file descriptor for io_uring to wait on. For wayland, this should be pretty easy. Haven't looked into X11 but worst case scenario a backend can start a thread and do a blocking loop which writes to an in-memory pipe that is read by io_uring. Not very efficient, but it'll cleanly handle back pressure and message passing.

All files that need to be read will be mmaped into the client and server as read-only. Writes will happen through normal write syscalls which is supposed to be coherent (and note that multiple processes mmapping the same file is also coherent as the pages are simply shared across processes). We cannot write directly to mmapped regions because there is no guarantee on when the data will be written back to disk which means we would need our own https://man7.org/linux/man-pages/man2/msync.2.html flushing policy and that sounds unpleasant. Also I doubt performance will be good assuming writes trigger page faults. With normal write syscalls, all you should get is cache flushes for the affected addresses which in practice means no overhead at all since adding a clipboard entry shouldn't involve reading and most of the time clients won't be active so no data will be in the cpu caches. Or maybe that assumption is wrong and I'd need to run perf tracking tlb flushes to check, but the other issue with not using mmap for reading is that then everything needs to be loaded into individually malloced buffers (String) and that's definitely terrible for performance.

Entry encoding:

struct Entry {
  bucket_index: u20,
  size: u12,
}

Ring buffer header:

struct Header {
  version: u8,
  write_head: u32,
}

Process for adding a new clipboard entry:

Delete an item:

Move item:

Swap items:

(Un)Favorite item:

Changing the max number of items:

Nuke (aka delete everything):

Search:

Extra commands for the client:

Extra commands for the server:

How data allocation works:

Directory structure:

Random notes:

Resources for integrating with the clipboard

GUI/TUI

Two lists side-by-side, one for favorites and one for normal items. This ensures you can have a lot of favorites without it making it annoying to access normal items. Only 100 items are loaded. If someone asks for it, maybe I could implement pagination, but in GCH I personally never use pagination and only search.

CLI

Commands:

SUPERCILEX commented 7 months ago

Call the project Ringboard.

SUPERCILEX commented 7 months ago

An entry ID is the concatenation of a tag ID and a ring ID, each 32 bits. So the full entry ID is 64 bits. This enables the architecture to seamlessly transition to tagged entries if I ever decides there's any point to doing that.

SUPERCILEX commented 7 months ago

Next steps:

SUPERCILEX commented 6 months ago

Next steps:

SUPERCILEX commented 6 months ago

Dumb idea, but what if the X11/Wayland watchers are just clients? So they run in their own binaries which means I don't have to care about bloat, feature flags, performance stuff etc.

SUPERCILEX commented 6 months ago

Note in blog post that metadata files are fine because they're so tiny: https://www.kernel.org/doc/html/latest/filesystems/ext4/overview.html#inline-data

SUPERCILEX commented 6 months ago

Thoughts on error handling: to keep things simple, we assume that requests are only made by trusted code, so we don't bother validating the contents of requests. This also means we don't have to make the server resilient to bad clients and it can just shut down when something goes wrong. Basically we're cool with clients performing denial of server. However, assuming the code is trusted, we do perform error handling for dynamic values (for example to check that the ID a user asks us to move is valid).

SUPERCILEX commented 6 months ago

Thoughts on reading data and search:

SUPERCILEX commented 4 months ago

Oooooh, I'm a dumdum—instead of having to create a metadata file, I can just use extended file attributes! There's even user.mime_type already defined for this purpose: https://wiki.archlinux.org/title/Extended_attributes. Those metadata files were really getting under my skin, so this is sweet. Extended attributes are even stored inline if they're small: https://www.kernel.org/doc/html/latest/filesystems/ext4/dynamic.html#extended-attributes

SUPERCILEX commented 4 months ago

Next steps:


Solution to the duplicates problem:

SUPERCILEX commented 4 months ago

https://www.uninformativ.de/blog/postings/2017-04-02/0/POSTING-en.html and https://github.com/linebender/druid/blob/master/druid-shell/src/backend/x11/clipboard.rs as resources for the X11 clipboard. I think we can basically use the blog post a good reference. For waiting: https://github.com/cdown/clipnotify/blob/master/clipnotify.c.

SUPERCILEX commented 4 months ago

For non-plain-text entries, we should add some notion of context that can be stored in extended file attributes for the UI to show. For example, if you copy a google slide, there's nothing we can show (without trying to parse Chrome's custom data format, no thanks) that will be useful. But we could ask for the X11 window title at the time of copy and save that to show in the GUI. Better than nothing.

SUPERCILEX commented 4 months ago

Add mut methods for EntryReader that remap as necessary. After deduplication, rewrite stats to use new technique. After X11, fuzz and then publish.

SUPERCILEX commented 4 months ago

Still not sure what to do about paste. There are three options:

SUPERCILEX commented 4 months ago

For deduplication it turns out you there's no point in trying to be fancy because you'd need to store the contents to avoid hash collisions, so I gave and just implemented a crappy static hash table.

SUPERCILEX commented 4 months ago

For change broadcasting, it occurred to me that inotify/fanotify might be useful but I can't seem to find a way to subscribe to a file range. Otherwise we could have actually used that to subscribe to entry changes plus the write head.

SUPERCILEX commented 4 months ago

UI ideas:

SUPERCILEX commented 2 months ago

For search, it might be worth supporting incremental plain text search. That is, for every query the user types that is a superset of the previous query, we can search through only the existing results rather than having to perform a full database search again. Furthermore, we can start searching each of these results at the saved start location of the query result (minus however many prefix characters were added in the superset query). This would basically make subsequent searches instant as it's just confirming whether or not the newly expanded query still matches rather than finding the query location all over again.

SUPERCILEX commented 1 month ago

Closing in favor of new issues.

SUPERCILEX commented 1 month ago

For pasting I decided the library is the right approach because it allows building a daemon (that uses the library) if it turns out that's what we want.

SUPERCILEX commented 1 month ago

I looked into adding migrations from https://github.com/hluk/CopyQ and https://github.com/Slackadays/Clipboard. CopyQ uses a binary format and I don't want to have to compile C++ to be able to build our CLI so that's not doable. Slackadays's clipboard defaults to being transients and only supports files or single entries from the command line, so I'm guessing most people don't actually store long-term history in there.

SUPERCILEX commented 1 month ago

Ok I changed my mind about pasting. The problem with in-process pasting is that if the UI wants to exit immediately after pasting you're screwed because you need to wait for someone to bite (the compositor should immediately, but who knows) and then you can't offer good pasting services like target selection. One option is to fork on paste but if you're spawning a process on paste anyway, might was well run a paste server. Then it occurred to me that the watcher servers already have connections to x11/wayland so why not re-use those? They'll just spin up a socket that accepts files to offer as paste.