Open Kevin-Jin opened 7 years ago
It would probably be best to keep the string pool and the main table in separate files so that we don't have to worry about shifting a bunch of bytes when we want to add new strings to the pool or add more rows to the table.
In case the user uses the same string repeatedly and decides to not use factors instead, we should be able to reuse strings. Implement an associative map inside the string pool file that associates a 64-bit pointer to the string with a tuple key consisting of the string's hash and length. At the location pointed to, the actual string value should be preceded by a header consisting of the length of the string and a reference counter.
In order to reduce the number of bytes that need to be shifted, implement a primitive malloc
scheme inside the string pool file. Every time a string is no longer referenced, place it on a doubly linked list of free blocks (sorted by address order) consisting of nodes that encode the length of the freed string and a pointer to the next and previous free blocks. When adjacent strings are freed, coalesce the free blocks by summing the lengths of the freed strings.
The challenge is in the algorithm that selects the free block for a new string of a given length that minimizes future heap fragmentation. Perhaps consider Doug Lea's malloc
or the binary-buddy allocator as a starting point.
We can also try being clever and postpone a table rebuild when assigning a too-long string to a fixed width column by writing a little endian integer after the null character that serves as a pointer to a string pool. If the fixed width column is at least 3 bytes wide, we can support "string continuations" that begin less than 0xFFFF bytes into the string pool file, etc. We have to be careful and add padding to the string pool to avoid pointers where the the least significant 8 bits match with with our representation for NA strings.
The fixed-width string solution is great for cache performance when strings are short but leads to a lot of wasted space and cache thrashing when the longest string is much longer than the shortest string. I propose to add a new
clob
(character large object)Ctype
as an alternative to thepadded.character
Ctype
. Like Stata 13+'sstrL
type, they will be implemented using an additional level of indirection where 64-bit pointers stored in the table's rows will refer to some character offset in a separate blob of text that does not need to be aligned and padded.The user can either explicitly request this method of storing strings or
as.mmap.character()
can automatically decide on this method based on the variability in the lengths of the strings.