Improve Density of GC heap by String Interning (de-duping) on Gen2 GC

vancem commented 7 years ago

It has been noted that for some applications (notably applications like MSBuild or Visual Studio) that manipulate many file paths, or deserialize large text-based payloads (e.g. JSON or XML), tend to have many duplicated strings. We could reduce the size of the GC heap by interning them (that is using the same string for all instances of a particular string value). The runtime already does this for string literals, but strings that are constructed at runtime do not benefit.

To implement this you need a table that remembers all the existing strings that is indexed by string value. While you COULD do this table lookup check when strings are first created, this is not likely to be a good approach because MANY strings have very short lifetimes, and it would slow down ALL strings creation.

Instead the idea is to do the interning check when as part of promoting the object from GC generation 1 to GC generation 2. This is a nice place to do it because

Most strings die before reaching Gen 2
However if they do make it, they are 'expensive' strings in that they are likely to live a long time.
Thus it makes sense to de-dup at that point

Another nice aspect of this feature is that it does not need to be perfect (you dont' HAVE to dedup everything). Thus the hash table you keep can be of fixed size with 'replace on collision' semantics, which is simple and bounded, and tends to favor younger strings (all good things).

The expectation is that typical apps have 20% of their GC heap be strings. Some measurements we have seen is that for at least some applications, 10-30% of strings all may be duplicated, so this might save 2-3% of the GC heap. Not huge, but the feature is not that difficult either.

The first step to build enough of a prototype, so that we can run it on a number of interesting apps and get a feel for how much GC space would save us and how much overhead this would add to (gen 1 and gen 2) GCs.

sharwell commented 7 years ago

⚠️ This would make the lock (someString) syntax subject to arbitrary and uncontrollable collisions.

swgillespie commented 7 years ago

I feel like we already have this problem, since you can intern arbitrary strings already with string.Intern. Though you can query whether or not a string is interned with string.IsInterned, so it's perhaps a bit less arbitrary than this, locking on interned strings can cause behavior just as surprising as this.

sharwell commented 7 years ago

@swgillespie That is a very different problem, which can be described as "If you are going to lock on a string, don't intern it." This is easily understandable and avoidable if locking on a string is something you intend to do. The problem this proposal would introduce is "Don't lock on a string because someone else may randomly intern it even if you don't."

⚠️ This proposal also breaks sentinel strings which serve some intentional purpose through reference equality, such as the implementation discussed in comments in antlr/stringtemplate4#132.

danmoseley commented 7 years ago

@sharwell presumably we could add an app-level opt out for rare cases where a library depends on that.

I note that Java interns the backing char array, not the object, and note that avoids the problem of locking on the string (http://openjdk.java.net/jeps/192). @vancem is that possible? It could reduce the locality of the string object and the char array.

jkotas commented 7 years ago

We do not use separate arrays for strings like Java does. The string payload is inline (it is smaller that way).

ufcpp commented 7 years ago

For the purpose of deserializing text-based payloads, we can parse and compare numbers and property names directly from UTF8 binary to reduce heap allocations. For instance this Utf8Json library can deserialize JSON texts with very small allocations.

sharwell commented 7 years ago

presumably we could add an app-level opt out for rare cases where a library depends on that.

How would a user know? The example is part of a publicly distributed library, and is an implementation detail which is never intended to be part of the public documentation/usage requirements.

vancem commented 7 years ago

The possibility of someone using lock on a string object that @sharwell is something we would have to at least worry about. It was already a bad idea to use strings as lock object (because of literal interning), and we would want to make that very clear by adding a compiler error (it can't be perfect, but it is likely to catch most/all real world usage of strings as lock objects). Part of the investigation would be to see how common it is (and in particular does it really ever happen that the SAME string value is used for lock objects). There are mitigations (don't intern when we notice the string has been used for a lock), but hopefully the simpler solution of enforcing 'don't do that' will be enough...

I don't think this issue should block the prototype, but if someone wantsto run an analyzer over all nuget packages looking for locks on strings, that would be useful information to have.

benaadams commented 7 years ago

This would make the lock (someString) syntax subject to arbitrary and uncontrollable collisions.

Locking on a string can be problematic; if you use that string in as a dictionary key, or do anything else that calls GetHashcode on it will mutate its syncblock and change from being able to have fast locks to needing slow locks (though string's syncblock is weirder than the others anyway as it stores isAscii and things, so may already be a slow lock object?)

A good candidate for a lock is a vanilla object e.g. object _lock = new object(); that's not used for anything else - unless you are string to save space; when an object that GetHashcode() definitely isn't called on works too.

This proposal also breaks sentinel strings which serve some intentional purpose through reference equality, such as the implementation discussed in comments in antlr/stringtemplate4#132.

There is no String(string s) .ctor in .NET and at compile time any string in source is interned; you need to construct it from a char[] to avoid it. e.g. new String("key".ToCharArray())

Even have some tests confirming this behaviour - to ensure object.ReferenceEquals(s0, s1) can be called on 2 different strings specified in different places in source using quotes e.g. "a string" and elsewhere "a string"

e.g.

ReferenceEquals("a string", "a string")

Will return true

vancem commented 7 years ago

@benaadams - just FYI, literal strings are OFTEN interned, but are not guaranteed to be interned. In particular literal strings that are precompiled but live in different modules will not be interned (since at compile time we don't know about the other and we don't want to do the extra work at runtime to make interning perfect).

For quite a while we have wanted to get off this 'locks are on object' convention. The idea is that we make up a real 'Lock' object and you use that. This avoids all the weird sync-block stuff. We should probably open an issue on that as well.

But for now, the issue @sharwell brings up is valid, but there are mitigations (although not perfect).

FSou1 commented 7 years ago

@vancem Should the GC to remove strings from this table when they become into the 2nd generation and won't be used later?

benaadams commented 7 years ago

@FSou1 would do that as is normal now?

I assume the idea is to catch a string before it moves to Gen2; and either replace its reference with one already in Gen2 or one about to move to Gen2 (so only one string is promoted - rest are collected).

vancem commented 7 years ago

@vancem Should the GC to remove strings from this table when they become into the 2nd generation and won't be used later?

The table would hold weak handles to the strings. Thus when the strings die they naturally also die in the table (but there is logic necessary to reuse the weak handles).

Maoni0 commented 7 years ago

The topic of string de-dup had come up before. The reasons why we didn't implement it included the following:

1) the benefit is not so obvious as folks already pointed out so far on this thread.

2) java's strings' payload is separated from the header object which allows most of the de-dup work to happen on a background thread; we can't do that because ours is inline and increasing the stop-the-world pause is something we are reluctant to do unless there is a strong justification.

sharwell commented 7 years ago

⚠️ This could also break code where users intentionally or unintentionally modify string contents in unsafe code. It's been a long time now, but I have seen an application before that used unsafe to turn strings mutable as part of some allocation-reducing operation that for some reason wasn't using StringBuilder.

vancem commented 7 years ago

This could also break code where users intentionally or unintentionally modify string contents in unsafe code.

Modifying string contents is actually speced to be illegal (since strings are speced to be immutable, code does depend on this immutability). While I understand that it might still happen, I don't think we should be blocking improvements for the sake of illegal code.

benaadams commented 7 years ago

seen an application before that used unsafe to turn strings mutable...

Is normally done at creation, so would be a Gen0 string?

e.g.

string s = new string('\0', length);
fixed (char* pS = s)
{

As soon as the string gets used its "gone bad" as far as immutability is concerned; as flags get set on it which change code paths (like aforementioned IsAscii flag)

jkotas commented 7 years ago

Is normally done at creation, so would be a Gen0 string?

Not guaranteed. The GC can kick in right after string s = new string('\0', length); and move the string to Gen 2.

The CoreLib code that does this would need to be changed accordingly - to exclude strings being created from de-duping. That's fine because of this code is part of the runtime and fully under our control.

benaadams commented 7 years ago

Don't dedupe strings that start with a null \0 or dedupe to strings that are currently pinned?

vancem commented 7 years ago

It is true that strings need to be mutable when they are first created (via copy etc), and we want to exclude strings that are not done being mutated. We can probably cover most of the cases, but not de-duping any string that ends with a '\0'. Since I think all initializations of strings we do we do a memcpy to fill, when the last character has been initialized we can assume it is stable.

But you can see that there is some subtlety here...

benaadams commented 7 years ago

Subtleties aside; intrigued to see how a prototype plays out

Not sure how much of the SO 30,703,367 "System.String" objects (4,120.10 MB) strings are dupes https://github.com/dotnet/coreclr/issues/7083; might be some :)

redknightlois commented 7 years ago

@sharwell We are using that, but as @benaadams and @vancem said, as it is known and explicit knowledge that mutating strings is illegal, therefore you know you are "in for a bumpy ride" if you do that. Not now, because we essentially built our own String class called ByteString that works with a backing of unmanaged memory, but the overload of dupes was the reason why we are not using String anymore if we can avoid it.

IMHO this approach shouldnt be dismissed on the grounds of "what if" and a thorough investigation should be pursued with prototyping and actual field measurement. The potential is huge for most of the codebases I have worked over the years. Allocations are usually dominated by the string class, and when you look at the instances on dumps, allocations runs rampant with duplicates which in many cases are too risky to decide to intern for the general case. Something the runtime has information to deal with.

colgreen commented 7 years ago

The situation around certain file formats having lots of repeated strings can be addressed within specific libraries for handling those formats, i.e. by using String.Itern() during parsing. In fact I think System.Xml already does this to some extent, e.g. XmlNameTable(?) (not sure if it uses String.Intern or it's own dictionary of strings).

The GC approach sounds potentially problematic because although we may expect the address of an object to change (that's fundamental to using a managed environment), we don't currently expect that obj1 == obj2 will evaluate differently seemingly at random times. To me that seems like it could unleash a bunch of unintended consequences, i.e. break code where that assumption has been made.

Interesting idea though.

benaadams commented 7 years ago

If the string values are equal than obj1 == obj2 will evaluate to true whether or not they are the same object. If the two string are infact the same object then the equality will fast path through object reference equals rather than comparing every char, so it may be an extra win there also.

redknightlois commented 7 years ago

@colgreen The issue with .Intern() and it should be deal with care is that whenever you decide you are done, there is nothing else you can do about it... So, you can have spikes of similar stuff that you decide to stay and that eventually means that there is no reclamation policy. You cannot .Unintern() it later, something the GC can. Very bad for servers that must have long uptimes.

Also the documentation sais (emphasis is mine):

First, the memory allocated for interned String objects is not likely be released until the common language runtime (CLR) terminates. The reason is that the CLR's reference to the interned String object can persist after your application, or even your application domain, terminates. Second, to intern a string, you must first create the string. The memory used by the String object must still be allocated, even though the memory will eventually be garbage collected.

In this case the second will still happen for Gen0 but whenever you are going to move it up, you are a lookup away to do the right thing. With the added benefit that if you are running a Gen2 collection and the reference is dead, you can reclaim that spot for someone else.

There is little downside (mostly on already illegal constructs or just absurdly dangerous like not using a sync-root), and a lot of upside to let the GC deal with interning automatically.

colgreen commented 7 years ago

@benaadams Point taken. The equality test will however have taken different execution paths, i.e. I should have said "object.ReferenceEquals(obj1, obj2) will evaluate differently seemingly at random times".

equality will fast path ... so it may be an extra win there also.

Yeh fair point.

colgreen commented 7 years ago

There is little downside (mostly on already illegal constructs or just absurdly dangerous like not using a sync-root)

I certainly agree that any issue around sync locks on string objects are moot, i.e. it's a horrible pattern that I can't see being a problem if it's made into a compile time warning or error, or perhaps even a runtime check.

mikedn commented 7 years ago

The situation around certain file formats having lots of repeated strings can be addressed within specific libraries for handling those formats, i.e. by using String.Itern() during parsing. In fact I think System.Xml already does this to some extent, e.g. XmlNameTable(?) (not sure if it uses String.Intern or it's own dictionary of strings).

Yes, SystemXml does avoid creating duplicates via "name tables". No, it doesn't use String.Intern. Overall it does a pretty good job IMO, a job that can be replicated by any other library that works with lots of strings that have a high chance of being duplicated:

String.Intern is problematic due to lifetime issue. SystemXml's manual interning doesn't have such issues.
To use String.Intern you have to have a string in the first place. And if you don't? Well, you allocate a string and then intern it. SystemXml's manual interning also works with char[].
SystemXml's manual interning allows it to use reference equality on strings. The automated solution discussed here cannot be relied upon because interning is randomly delayed.
SystemXml's manual interning chooses what to intern (e.g. XML element names, attribute names), strings that are likely to be duplicated and are relatively small in size. Automated string interning has no clue what it is doing, it may very well end up doing busy work.

redknightlois commented 7 years ago

@mikedn True, it may be busy work. However copy stuff that is already there to Gen2 is busy work too. So that begs the question, which one is the most efficient busy work in the general case. That is what should be studied and act upon.

danmoseley commented 7 years ago

Just wanted to note this is marked as up-for-grabs as we believe this is something that need not require someone on the .NET team. Anyone interested? Could be rewarding/interesting.

Rattenkrieg commented 7 years ago

@danmosemsft actually I have been working on it already about a month on evenings :p Designed an algorithm polite to STW and did about 1k lines.

pentp commented 5 years ago

The locking issue can be somewhat mitigated by not de-duping strings that have locking and/or hashcode info set in the sync-block (BIT_SBLK_IS_HASH_OR_SYNCBLKINDEX is set). This avoids any observable changes for RuntimeHelpers.GetHashCode (used in object identity based dictionaries for example) and locking issues are mitigated, but not entirely avoided by this (thin locks could have been used on the object already).

bill-godfrey-celaton commented 5 years ago

Could this be extended to other classes in addition to System.String?

If an object's class includes a function with an attribute, have the GC call that function when it moves that object into a different generation.
Allow code to request that the framework find all references to object A and replace them with references to duplicate object B.

jhudsoncedaron commented 5 years ago

In reading all of this, I feel the best way to maintain compatibility is to institute the rule "don't deduplicate a string create from String.Copy()"

Once long ago I used to construct strings with Spc(number) [hint: Microsoft.VisualBasic.dll] than proceed to manipulate the memory arena before passing the strings to other modules. That code doesn't exist anymore, but it kinda serves a point now. There aren't any really good heuristics for when a string is going to be stomped on, except for String.Copy() which pretty much says I'm going to. And even then there'll be some mistakes.

GSPP commented 5 years ago

This feature would introduce non-determinism in the sense that any reference of type string or object can now change between any two instructions. !ReferenceEquals(a, b) && ReferenceEquals(a, b) can now be true. References inside of hash tables can change. The resulting bugs are extremely hard to find and non-intuitive.

This feels very much against the spirit of .NET. We want deterministic and platform agnostic code.

Maybe this is valuable as an opt-in for power users who know what they are doing and require the performance gain.

Maybe we can instead expose a managed API that can be used to enumerate the heap and replace object references. That way this can be implemented as a library. But then again I wonder... do we even want to expose such an API? We don't want this to be abused. We don't want to encourage brittle programming paradigms such as enumerating the heap and mutating it. Who knows what atrocities will be committed using such an API.

benaadams commented 5 years ago

References inside of hash tables can change.

They do anyway unless they are pinned with a GCHandle, the GC moves them around. .GetHashCode() and .Equals however will return the same since its based on the contents of the string.

GSPP commented 5 years ago

@benaadams this is invisible at the managed code level. GetHashCode and Equals always behave deterministically even for new object(). Anything else would be a disaster.

mikedn commented 5 years ago

.GetHashCode() and .Equals however will return the same since its based on the contents of the string.

Assuming you're using String's own GetHashCode and Equals. One can build a custom equality comparer that relies on object identity instead (RuntimeHelpers.GetHashCode) assuming that the keys are strings that have been interned using a custom scheme, other than String.Intern. De-duping strings will probably break this and do so non-deterministically (e.g. if you're lucky your interned string will survive and other will be replaced by it).

benaadams commented 5 years ago

One can build a custom equality comparer that relies on object identity instead

RuntimeHelpers.GetHashCode changes the Object Header and I'd assume object headers state would be considered when deduping

weltkante commented 5 years ago

The scenario @GSPP was concerned about is probably not the GC moving things around, but the GC changing object identity. If the value of ReferenceEquals changes thats observable and may break code, for example hashtable using ReferenceEquals to resolve collisions.

So you have to assume strings having a hashcode in the Object Header means that RuntimeHelpers.GetHashCode has been called and the strings can't be merged. Otherwise, even if the hash code happened to be the same (for whatever reason), ReferenceEquals would have returned false before the merge, meaning hashtables would have considered it a collision. After the merge ReferenceEquals would return true and the hashtable would be broken even if the hashcode didn't change, because it now has two slots with the same object in it when it shouldn't have.

For what it's worth this pattern (identity based hashtables) is common in caching and the framework even has a builtin class ObjectIDGenerator for it.

pentp commented 5 years ago

Identify based dictionaries are not a problem at all because this can be detected from the object header and not de-duped: https://github.com/dotnet/coreclr/issues/14208#issuecomment-440714402

GSPP commented 5 years ago

Hash tables are one thing. But I find it very troubling that ReferenceEquals(a, b) can flip from false to true at any time. Breaking object identity is a fundamental change to the runtime.

If we determine that we want string deduping we can dedupe other things as well:

Boxed primitives and custom structs (could dedupe them at creation time as well)
Empty arrays
Custom application objects
Immutable collections

Just throwing some ideas into the ring... This would be possible with a heap enumeration API as a library. Also, heap enumeration could be used for custom cache eviction schemes as an alternative to GC handles. You can recycle an object if user code no longer references it. But you can retain your own reference normally.

benaadams commented 5 years ago

Boxed primitives

Yes; hence the concerns around Unsafe.Unbox returning a writable reference https://github.com/dotnet/corefx/issues/34258

custom structs (could dedupe them at creation time as well)

No; because they are the data not a pointer to the data. If you create on in a method it takes up stack space directly it doesn't point to somewhere which can be shared. Likewise if you have on in a class it takes up space directly in the object, it doesn't point somewhere which can be shared. Except boxed structs which are pointers as you outlined above.

GSPP commented 5 years ago

@benaadams yes, I did not phrase that well. I meant boxed custom structs. Also, any immutable DTO style class would be a candidate.

I have created a tickets for this: Add a managed API to enumerate and mutate the heap

jkotas commented 4 years ago

Design doc: #31971

jhudsoncedaron commented 4 years ago

Please don't auto-dedupe boxed custom structs. I use writable structs all the time.

weltkante commented 4 years ago

I don't think that was part of the discussion/proposal, so I don't think you have to worry about that.

rampaa commented 1 year ago

I've an app that deals with large quantity of long-living duplicate strings. I'm currently using the StringPool class from the Community Toolkit to mitigate the memory inefficiency partially but it's not as effective/efficient as this proposal would be. So a +1 from me.

dotnet / runtime

Improve Density of GC heap by String Interning (de-duping) on Gen2 GC #9022