Open xiaodaigh opened 6 years ago
You don't perform operations on the bool. It's not user-facing, and it is not safe for users to touch.
It needs exactly one operation: intern!
, which isn't really related to sorting or grouping.
I guess that operation could be called getkey!
.
It needs to check for the set containing an element equal the input, and if so return that element, if not add the input, and return it.
The pros and cons of alternative data structures are not around making any other operations efficient. So here they are:
Could be faster than current implementation
Work to implement basically from ground up
Comparisons could be faster than current hash function
Gets slower the more elements it contains. Technically so does Hashmap, but this is worse. Slowness gets worse if strings share common prefixes.
All pros and cons of TreeSet/dict, but additionally
Uses less memory in storage
Big Con: I can't workout away to actually pull this off that actually results in being able to return strong references to its contents, since it gains those memory savings by discarding the original strings and only storing parts. I guess maybe if we had a RopeString class still, but then the strings themselves would be slow ...
There are a few other finite-automata-ish string storing structures with the same kinda problems as tries but perhaps more so. Since a set of strings can be reduced down into definite a finite-automata that only accepts strings from that set. But I think that line of reasoning is a deadend of slowness
Yeah, but if the pool is stored in a data structure such that it's easy to obtain each element's rank within the pool then one can sort an InternedStrings vector quite fast using counting sorting. Suppose there are only 3 possible elements "a", "b", "c" and so their rank is 1,2,3. We can scan through the vector and keep a count of the rank, and then use the counts to sort the vector using the same algorithm as counting sort. I don't think this is currently possible.
Nice properties for the pool data structure are easy to look up, easy to insert, easy to delete, and easy to find rank
Rather than talking about properties of a pool do you want to take a step outwards and talk about properties of the InternedString
Right now, the property of the interned string is:
You are suggesting adding:
<
/ >
checkingRight?
Yeah
Due to the removal of the InternedString type this is not directly possible anymore,
While there is still some value in the consideration,
and perhaps we could definate some kinda of order(a,b)= getind(intern(a)) < getind(intern(b))
type deal.
It is looking less likely.
I'm so dubious about the using of Tries for performance or for memory decreasing, for most use cases. But like maybe it could be a thing. It might actually be faster than hashing I guess.
I used to use B+ trees for interning (this was for huge tables, millions of entries, the B+ trees were buffered in shared memory, with TB of disk). It worked very well for us (and this was for NLP work), a nice thing was being able to get ordered output fast (at least, by Unicode codepoints), made finding prefixes trivial.
Currently pool is a WeakRefDict. Wondering if it can be made into something else to allow for more operations to be more efficient e.g. sorting/grouping. A B-tree perhaps?