ITensor / ITensors.jl

A Julia library for efficient tensor computations and tensor network calculations
https://itensor.org
Apache License 2.0
532 stars 123 forks source link

Symbol vs SmallString #608

Open saolof opened 3 years ago

saolof commented 3 years ago

Just one point of input: SmallString is twice as large as Symbol and more expensive to both hash and compare, in addition to being library-specific, and occasionally having some limitations (which led to #550 and #551 for example).

As a result, it looks a bit like an artefact from porting the C++ version to Julia, so it looks like there may be both a performance and convenience benefit to using Symbol instead of SmallString where possible? Or generally using an interned string type instead of a small bytestring.

mtfishman commented 3 years ago

It's a good question about which one to use, and each has tradeoffs. The main tradeoff of using SmallString is the limited tag length. I don't see why it would be more expensive to hash and compare since they are directly convertible to integers, so that should be very fast. In fact, within TagSet we actually directly store the tags as integers:

julia> ts = TagSet("tag1,tag2,xyzw")
"tag1,tag2,xyzw"

julia> ts.data
4-element StaticArrays.SVector{4, UInt128} with indices SOneTo(4):
 0x74006100670031000000000000000000
 0x74006100670032000000000000000000
 0x780079007a0077000000000000000000
 0x00000000000000000000000000000000

so hashing and comparison are integer operations. Currently, the SmallString type itself is not a single UInt128 but instead a SVector{8, UInt16}, but probably we could just make the SmallString type directly a UInt128 and construct it using bit manipulations.

For what it's worth, the maximum size of a tag is no larger than a String or Symbol:

julia> sizeof(ts"αβγϵωρδν"[1])
16

julia> sizeof("αβγϵωρδν")
16

julia> sizeof(Symbol("αβγϵωρδν"))
16

If you limit to ASCII characters, because the storage is a fixed size it does indeed have larger storage than is needed by String or Symbol:

julia> sizeof(ts"abcdefgh"[1])
16

julia> sizeof("abcdefgh")
8

julia> sizeof(Symbol("abcdefgh"))
8

One thing that is highly optimized (building on the SmallString design) is constructing TagSets from Julia Strings. At some point I did a test of trying to make a small set of Symbols from a string like "tag1,tag2,xyzw" (to make a TagSet with 3 tags) and I couldn't make it as fast as the current design. This conversion is pretty important since it happens all over the place. Of course we could use a different interface, but we find that syntax to be quite compact (compared to say something like [:tag1, :tag2, :xyzw] or ["tag1", "tag2", "xyzw"]). For very fast construction you can use the macro syntax ts"tag1,tag2,xyzw", but that is not always possible if there are runtime values in a tag, which is quite common.

Over all, my conclusion is that it would be nice to be able to use Symbols as the tag type for the sake of allowing larger tags (and maybe some average-case storage savings), but I wasn't able to make the conversion from string as fast as the current design. Over all, I haven't felt that the tag length limitation is too annoying, but if you or someone else comes across a compelling case of needing larger tags or more tags than we could think about how to increase the storage.

saolof commented 3 years ago

Right, upon closer examination of my quick microbenchmark comparison between SmallStrings, it turned out that SmallString's comparison inherits from AbstractVector, which compiles to a nested loop. It may be worth adding an overload to avoid this.

IntSmallString comparisons of course compile to a few ALU instructions (a few xors, then checking that everything is zero) to handle comparing the long int types, i.e. only a few more instructions than the symbol comparison that compiles to a single intq operation on x86 processors.

As far as convenience tradeoffs between the two for anything type-system related, I guess symbol has the advantage of playing nicely with metaprogramming and named tuples, while SmallString has the advantage of being an isbits type.

Edit: The sizeof function is misleading for symbols because it just returns the length of its string. However, Symbols are interned (they point to a unique string in a hash table, and creating a new symbol checks if the string already exists in the hash table before allocating a new one). So you only pass around the pointer to the interned string, and you can compare or hash symbols by just comparing/hashing the pointer. In terms of the size of what you pass around, symbols are pointer-sized integers

mtfishman commented 3 years ago

SmallString actually uses whatever comparison is used for SVectors (https://github.com/ITensor/ITensors.jl/blob/master/src/smallstring.jl#L97) which may just be a generic AbstractArray comparison.

The TagSet/SmallString design is a bit of a mess right now, and it is kind of inconsistent how SmallString is used (as I said, TagSets don't even store SmallStrings internally, instead they literally use UInt128 to store the tags). The main operations that go on throughout the library involve construction and comparisons of TagSets, which really use (or should be using) integer comparisons.

Agreed that SmallString comparison itself could be improved, though I'm not sure we need that type anymore since we don't even use it in TagSet. Based on the current design of TagSet, it would probably be best if SmallString was changed to literally have a UInt128 storage (instead of an SVector{8, UInt16}), and then we could directly use SmallString in TagSet. The main downside of that is the tag length is pretty inflexible, but we could allow larger SmallStrings by storing larger integer types from BitIntegers.jl.

In fact, it looks like that is exactly the design of ShortStrings.jl so we could consider just using that package.

Thanks for the information about sizeof for Symbol, I didn't know that is how Symbols are stored (I tried looking up how they are stored but couldn't find any information).