Improve the space-efficiency of complex selectors

proycon commented 1 year ago

I'd like to improve the space-efficiency of the complex selectors (MultiSelector/CompositeSelector/DirectionalSelector). In earlier discussions, we already established that the MultiSelector is a valid tool to annotate multiple targets using only a single annotation. However, in the current implementation, the selector is still implemented in a way that explicitly enumerates all the offsets in the text. So if you annotate 100,000 targets with a single annotation via a MultiSelector (saving yourself 99,999 annotations in the process), you still have 100,000 subselectors in memory.

This can be done more space-efficient. In Text Fabric @dirkroorda efficiently maps entire ranges of nodes to annotation content (features):

1-426590 word 426591-426629 book 426630-427558 chapter 427559-515689 clause 515690-606393 clause_atom 606394-651572 half_verse 651573-904775 phrase 904776-1172307 phrase_atom 1172308-1236024 sentence 1236025-1300538 sentence_atom 1300539-1414388 subphrase 1414389-1437601 verse 1437602-1446831 lex

I think we need a similar way to express large ranges in STAM. We too have 'nodes' that are expressed by an internal integer ID (TextSelections, Annotations, TextResources, AnnotationDataSets), and if there's a large contigent range of them we can refer to them by a simple begin intID and end intID (or multiple if there are non-contingent parts).

In ideal circumstances, we can then express complex selector with 100,000 subselectors using just one (new) ranged subselector instead.

Such a ranged subselector may be best kept as a part of STAM's 'extended model', i.e. parts of its internals and not expressed in canonical serialisation. This keeps the model simple and easier to interpret for the outside world, but uses the necessary optimisations internally.

There's one limitation in this approach: When targetting text, using such a ranged subselector would only work for 'simple' offsets, that is, offsets that refer directly to the resource using begin-aligned cursors. If the offset is relative (goes through another annotation) or uses end-aligned cursors, then we need to store a copy of that offset.

proycon commented 1 year ago

This is now largely implemented in stam-rust, except for the actual parsing that automatically creates ranged selectors.

dirkroorda commented 1 year ago

Yes, this is a handy device.

In Text-Fabric my general format for a target/selector is a comma separated list of ranges, and a range is either a single number or a string b-e where b and e are numbers.

I have not been pythonic, so 1-2 refers to items 1 and 2.

I may regret my choice in this.

I may also regret that I address character positions and not the boundaries between character positions.

So if I address slot n, I address the unit at position n. But a system where you can address zero-length points in the string is probably better.

If I had to do it again, I would number positions from 0 onward, 0 being the point before the first character.

(0, 1) would address the first character, (n, n+1) the n+1th character, and (1, 1) or 1 would address the point after the first character, so (n, n) or n is the point after the n-th character. (0, 0) is the point before the first character.

dirkroorda commented 1 year ago

By the way: I use these ranges when serializing features to disk. When I read them back into RAM, I let them explode again.

I'm still looking for data structures that can efficiently map numbers to strings, without storing all the numbers and yet fast in addressing those strings by number. A good candidate is the bisect operator (https://docs.python.org/3/library/bisect.html#module-bisect)

proycon commented 1 year ago

I'm reopening this because it may need further revision

proycon commented 6 months ago

Already implemented earlier, forgot to close.

annotation / stam

Improve the space-efficiency of complex selectors #11