cessen / ropey

A utf8 text rope for manipulating and editing large texts.
MIT License
1.04k stars 46 forks source link

Add feature to customize segmentation #7

Closed cessen closed 6 years ago

cessen commented 6 years ago

Right now Ropey is hard-coded to segment text based on extended grapheme clusters as defined in Unicode Standard Annex 29. However, grapheme clusters are supposed to be customizable to some extent, based on application. Annex 29 itself even mentions this. Moreover, some applications may want to segment based on something other than graphemes, or even not segment at all for even better performance.

Ropey should have a way to customize this behavior, while still having sane defaults for the large majority of applications.

Design

The tentative design for this feature is to have a trait Segmenter that has two methods, is_break() and seam_is_break():

trait Segmenter {
    fn is_break(byte_idx: usize, text: &str) -> bool;
    fn seam_is_break(left: &str, right: &str) -> bool;
}

These methods return whether or not a point in a text is a valid break between segments. Ropey will then use them to determine where it can and cannot split leaf nodes. is_break() will have a default (but slightly inefficient) implementation that calls to seam_is_break().

The Rope type will then take as a type parameter any type that implements Segmenter, like so:

struct Rope<T: Segmenter> {
    // ...
    _seg: PhantomData<T>,
}

The Nodes of a Rope will also take that same type parameter. Then T::is_break() and T::seam_is_break() will be used throughout the code to determine valid breaks in the text.

Note that neither method of Segmenter takes a self parameter. The trait is just a way to inject the functions into the Rope type without run-time overhead. The Segmenter types themselves will never be instantiated.

Finally, Rope will have a default Segmenter that segments based on extended graphemes, so most users of the library won't need to concern themselves with any of this. Ideally, this whole thing should be invisible to users that don't care about it.

Compatibility Between Ropes

All operations except append() should be compatible between Ropes with different Segmenters. Append can't work (at least not efficiently) because all nodes of the Rope being appended would have to be re-evaluated to conform to the segmentation strategy of the Rope being appended to. But things like PartialEq impls, etc. should work fine across different segmentation strategies.

Open Questions

Right now Ropes have methods for checking and working with grapheme boundaries, and an iterator for iterating over grapheme clusters. These will be easy to port to work via whatever Segmenter is given. However, their names (e.g. is_grapheme_boundary()) only make sense for graphemes.

So... this is a bit bike-sheddy, but what should they be called? My main concern with renaming them is that I don't want them to feel weird or too abstract for the 90+% use-case of just segmenting on graphemes. On the other hand, they shouldn't be outright wrong for other cases either.

At the moment, just substituting "grapheme" with "segment" everywhere seems like the least bad idea. But that may also get a little confusing compared to e.g. the various "chunk" functionality.

Drawbacks

I'm increasingly convinced that abstraction has a cognitive cost for users of a library. Making something simple, straight-forward, and purpose-built is usually a good default, and the choice to abstract things should undergo a cost-benefit analysis.

Making this change will make Ropey a little more abstract, and a little less concrete. The naming question above is a hint at that. People browsing the API docs will likely wonder "What is this Segmenter thing, and what is this weird type parameter on Rope and RopeSlice?", and the various methods that depend on Segmenter will be less clearly named for the common case. All of this imposes cognitive load on users of the library.

I still think this is the right choice, because how text is segmented is purpose-specific, and I would prefer for Ropey to not be unusable for anyone just because of its segmentation strategy. Nevertheless, I want to acknowledge that this comes at a cost.

cessen commented 6 years ago

This is now implemented in the custom_segmentation branch. Still needs to be documented.

This does indeed present a more complex-looking API, even though it's typical usage is identical to before. Alas...

One change I made in implementation is calling the trait GraphemeSegmenter instead of just Segmenter. The idea is to present this API as a means to customize grapheme segmentation, even though it could be used for other things. Its intent is graphemes. This also nicely dodges the naming issue!

cessen commented 6 years ago

Done!