Closed cessen closed 6 years ago
This is now implemented in the custom_segmentation
branch. Still needs to be documented.
This does indeed present a more complex-looking API, even though it's typical usage is identical to before. Alas...
One change I made in implementation is calling the trait GraphemeSegmenter
instead of just Segmenter
. The idea is to present this API as a means to customize grapheme segmentation, even though it could be used for other things. Its intent is graphemes. This also nicely dodges the naming issue!
Done!
Right now Ropey is hard-coded to segment text based on extended grapheme clusters as defined in Unicode Standard Annex 29. However, grapheme clusters are supposed to be customizable to some extent, based on application. Annex 29 itself even mentions this. Moreover, some applications may want to segment based on something other than graphemes, or even not segment at all for even better performance.
Ropey should have a way to customize this behavior, while still having sane defaults for the large majority of applications.
Design
The tentative design for this feature is to have a trait
Segmenter
that has two methods,is_break()
andseam_is_break()
:These methods return whether or not a point in a text is a valid break between segments. Ropey will then use them to determine where it can and cannot split leaf nodes.
is_break()
will have a default (but slightly inefficient) implementation that calls toseam_is_break()
.The
Rope
type will then take as a type parameter any type that implementsSegmenter
, like so:The
Node
s of aRope
will also take that same type parameter. ThenT::is_break()
andT::seam_is_break()
will be used throughout the code to determine valid breaks in the text.Note that neither method of
Segmenter
takes aself
parameter. The trait is just a way to inject the functions into theRope
type without run-time overhead. TheSegmenter
types themselves will never be instantiated.Finally,
Rope
will have a defaultSegmenter
that segments based on extended graphemes, so most users of the library won't need to concern themselves with any of this. Ideally, this whole thing should be invisible to users that don't care about it.Compatibility Between Ropes
All operations except
append()
should be compatible betweenRope
s with differentSegmenter
s. Append can't work (at least not efficiently) because all nodes of theRope
being appended would have to be re-evaluated to conform to the segmentation strategy of theRope
being appended to. But things likePartialEq
impls, etc. should work fine across different segmentation strategies.Open Questions
Right now
Rope
s have methods for checking and working with grapheme boundaries, and an iterator for iterating over grapheme clusters. These will be easy to port to work via whateverSegmenter
is given. However, their names (e.g.is_grapheme_boundary()
) only make sense for graphemes.So... this is a bit bike-sheddy, but what should they be called? My main concern with renaming them is that I don't want them to feel weird or too abstract for the 90+% use-case of just segmenting on graphemes. On the other hand, they shouldn't be outright wrong for other cases either.
At the moment, just substituting "grapheme" with "segment" everywhere seems like the least bad idea. But that may also get a little confusing compared to e.g. the various "chunk" functionality.
Drawbacks
I'm increasingly convinced that abstraction has a cognitive cost for users of a library. Making something simple, straight-forward, and purpose-built is usually a good default, and the choice to abstract things should undergo a cost-benefit analysis.
Making this change will make Ropey a little more abstract, and a little less concrete. The naming question above is a hint at that. People browsing the API docs will likely wonder "What is this Segmenter thing, and what is this weird type parameter on Rope and RopeSlice?", and the various methods that depend on
Segmenter
will be less clearly named for the common case. All of this imposes cognitive load on users of the library.I still think this is the right choice, because how text is segmented is purpose-specific, and I would prefer for Ropey to not be unusable for anyone just because of its segmentation strategy. Nevertheless, I want to acknowledge that this comes at a cost.