Optimize `TypeEngine` for garbage collection

ironcev commented 2 months ago

NOTE: This is the draft version of the proposal. It lists the three major points that a "garbage-collection-friendly" TypeEngine should support, but still not in enough detail. The concrete architectural proposal is also missing, as well as the results of measurements from a real-world project that support the reasoning behind the proposal.

The numbers given in the below description are coming from the compilation of the Spark Orderbook workspace, a realistic real-world project.

Shareable `TypeInfo`s and their lifetime

We use the term shareable type as defined in #6613.

Examples of shareable types are all built-in types like, e.g., u64, but also types like, e.g., Option<u64> or MyStruct or MyGenericStruct<bool>.

The number of those types is orders of magnitude smaller then the overall number of types inside of the TypeEngine, and they can be heavily reused across all the projects and modules, which is currently not the case. (TODO: Add data from measurements.)

Currently, the TypeEngine optimized in #6613 reuses the TypeSourceInfos of shareable TypeInfos and not the TypeInfos themselves. This is sub-optimal because shareable types that cannot be changed in between garbage collections are still assigned a source_id and are being garbage collected. This table shows the content of the shareable_types hash map. It is evident that types like !, (), bool, etc are stored unnecessarily many times.

Nevers              :       35
String slices       :        8
String arrays       :       11
Unsigned integers   :      182
Enums               :      287
Structs             :      276
Booleans            :       26
Units               :       79
Tuples              :       12
B256s               :       47
Contracts           :        1
Error recoveries    :        2
Arrays              :       25
Raw pointers        :       31
Raw slices          :       21
Trait types         :        4

Those types should "live forever" and never be garbage collected.

Also, the shareable types like Option<u64> should be shareable across modules, means for different source_ids we should still point to a single shared Option<u64> instance. Note that this instance should be GCed if the definition of Option gets changed.

Distribution of `TypeInfo`s across `source_id`s

TypeInfos that should get garbage collected should be distributed towards the leafs of project and module dependency tree. This means, the types should have assigned, if possible of course, source_ids of the files that are likely to be changed, and those are always the files in the projects that developers actually work on (the leafs), and not the files in the dependencies, especially the standard library dependencies.

E.g., currently, if we have fn my_function<A>() -> Option<A> in the code we are editing (a leaf!), the non-shareable Option<A> type will get assigned the source_id of the original Option declaration, the one from the standard library. Which means it will never be GCed, although it should be GCed whenever we GC the current module the programmer is changing.

E.g., currently, all of the TypeInfo::Unknown types do not have source_id assigned, and not all of them get replace()ed in the engine. Out of ~41.000 inserted Unknowns, some ~4.500 remain unreplaced, and, not having source_id assigned, can never be GCed. This produces a constant small "memory leak" within the TypeEngine.

Time-performant garbage collection

TypeEngine should, ideally, offer an O(1) removal of files during the GC.

E.g., before the optimization done in #6613, the ConcurrentSlab use to hold ~510.000 elements. All those elements needed to be traversed during GC, to remove just a handful of types. After the optimization, the slab contains ~230.000 elements which speeds up the traversal, but sill has an O(n) complexity.

Architectural proposal

TODO: Write proposal.

tritao commented 2 months ago

A GC-friendly TypeEngine should never garbage-collect shareable types. The number of those types is orders of magnitude smaller then the overall number of types inside of the TypeEngine, and they can be heavily reused across all the projects and modules. Shareable types should "live forever" and never be GCed, and thus, never be assigned to any particular source_id. Examples of shareable types are all built-in types like, e.g., u64, but also types like, e.g., Option<u64> or MyStruct or MyGenericStruct<bool>.

The idea of shareable types is interesting and not something I really considered before outside the builtin core types. Could you maybe expand on this concept of a shareable type? Would they be std or just core types maybe? And would the main difference be they don't contain a source id? I am wondering if that could impact some diagnostic messages which might be pointing to some core types.

Also I get a bit unsure when you say that they should never be GC'd. First lets say we consider all std types to be shareable and thus not considered for GC. But what would happen if lets say, we change something in the std, and recompile a project that has been cached (and thus referencing the old std cached type ids). In that case, we would need to GC the old std types out of the type engine right?

3. A GC-friendly TypeEngine should, ideally, offer an O(1) removal of files during the GC.

A potential idea me and @JoshuaBatty previously discussed before would be the possibility of having separate per-module engines, an id would have a reserved space to contain the engine id. This would allow a O(1) removal of the engine data when the module is collected.

ironcev commented 2 months ago

Thanks for the early feedback and excellent questions @tritao! This issue description indeed requires more elaboration including the proper definitions and more concrete examples of measurements that justify proposed design. This first issue description is primarily created so that I can link to it in several TODOs left in the code that call for future changes in the TypeEngine. I'll elaborate more on all points including examples as well as explaining concepts in detail. After your question, I see that the description in the first point is pretty misleading. Also, it is important to add how it all works in sync with the DeclEngine during GC. And for the third point, how a solution could look like without having separate engines. Thanks a lot for the feedback and questions and stay tuned :smile:

FuelLabs / sway