JuliaString / InternedStrings.jl

Fully transparent string interning functionality, with proper garbage collection
Other
27 stars 6 forks source link

Consider pros and cons of alternative data structure for pool #6

Open xiaodaigh opened 6 years ago

xiaodaigh commented 6 years ago

Currently pool is a WeakRefDict. Wondering if it can be made into something else to allow for more operations to be more efficient e.g. sorting/grouping. A B-tree perhaps?

oxinabox commented 6 years ago

You don't perform operations on the bool. It's not user-facing, and it is not safe for users to touch.

It needs exactly one operation: intern!, which isn't really related to sorting or grouping. I guess that operation could be called getkey!. It needs to check for the set containing an element equal the input, and if so return that element, if not add the input, and return it.

The pros and cons of alternative data structures are not around making any other operations efficient. So here they are:

Custom Hashfunction hashset/dict

Pro:

Could be faster than current implementation

Con:

Work to implement basically from ground up

TreeSet/dict

Pro

Comparisons could be faster than current hash function

Con

Gets slower the more elements it contains. Technically so does Hashmap, but this is worse. Slowness gets worse if strings share common prefixes.

Trie

All pros and cons of TreeSet/dict, but additionally

Pro

Uses less memory in storage

Con

Big Con: I can't workout away to actually pull this off that actually results in being able to return strong references to its contents, since it gains those memory savings by discarding the original strings and only storing parts. I guess maybe if we had a RopeString class still, but then the strings themselves would be slow ...

There are a few other finite-automata-ish string storing structures with the same kinda problems as tries but perhaps more so. Since a set of strings can be reduced down into definite a finite-automata that only accepts strings from that set. But I think that line of reasoning is a deadend of slowness

xiaodaigh commented 6 years ago

Yeah, but if the pool is stored in a data structure such that it's easy to obtain each element's rank within the pool then one can sort an InternedStrings vector quite fast using counting sorting. Suppose there are only 3 possible elements "a", "b", "c" and so their rank is 1,2,3. We can scan through the vector and keep a count of the rank, and then use the counts to sort the vector using the same algorithm as counting sort. I don't think this is currently possible.

Nice properties for the pool data structure are easy to look up, easy to insert, easy to delete, and easy to find rank

oxinabox commented 6 years ago

Rather than talking about properties of a pool do you want to take a step outwards and talk about properties of the InternedString

Right now, the property of the interned string is:

You are suggesting adding:

Right?

xiaodaigh commented 6 years ago

Yeah

oxinabox commented 6 years ago

Due to the removal of the InternedString type this is not directly possible anymore, While there is still some value in the consideration, and perhaps we could definate some kinda of order(a,b)= getind(intern(a)) < getind(intern(b)) type deal. It is looking less likely.

I'm so dubious about the using of Tries for performance or for memory decreasing, for most use cases. But like maybe it could be a thing. It might actually be faster than hashing I guess.

ScottPJones commented 6 years ago

I used to use B+ trees for interning (this was for huge tables, millions of entries, the B+ trees were buffered in shared memory, with TB of disk). It worked very well for us (and this was for NLP work), a nice thing was being able to get ordered output fast (at least, by Unicode codepoints), made finding prefixes trivial.