Open fogzot opened 1 year ago
I assume we can build the tokens from str. let char_indices
compute the offset here.
let source = "AB🙃";
source.char_indices().map(|(i, ch)| Token {
ch,
offset: i as u32,
len: ch.len_utf8() as u8,
info: ch.properties().into(),
data: 0,
});
I use SourceRange like this. The start
and end
is defined in code units. You should get the idea.
source[source_range.to_range().start..source_range.to_range().end]
The documentation of the
text::cluster::Token
module does not explain what a code unit is. From the example code in theshape
module it seems that theoffset
property is index of the character in the text andlen
its length when represented as UTF8, but is it?In my code I don't use UTF8 strings because I have extra information and I keep an array of "chars" like this:
I suppose this is three tokens but what values for
offset
andlen
should one use?Should the offset of the third token be 2 (logical index into the characters) or 3 (index into my array)?