Clarify what data should be static, on pipeline or on subsystem

KathleenDollard commented 1 month ago

Background

We are leaving the door open for multiple pipelines in simultaneous use (multiple declared and possibly running on different threads). This allows for a different set of subsystems for each scenario/pipeline.

We are using a single data storage location - a static strongly typed weak table - to support multiple threads.

The design demands that any particular CLI be created on a single thread and not modified once it is in use by multiple threads.

It does not currently block the same symbol from being used in multiple pipelines/threads.

Problem

Our discussion yesterday was that providers should be pipeline specific. This is problematic for two reasons:

In a multi-pipeline scenario, the CLI author would have things declared on a symbol available from two pipelines if it was defined via normal means, and not available on the second pipeline when a provider is used. Arguably a niche case, but super weird and a flag that we should look at this design
I see no way to associate a symbol to a pipeline. By intuition this seems necessary, and is necessary if anything outside the pipeline, or where the containing pipeline is not easily accessible needs to access the data.

Possible solutions

While we probably could work through these scenarios, there are at least two ways to avoid the problem:

Allow providers to be global and not attached to a pipeline. This provides the same effective scope for normal assignment and assignment via a provider. It also has the downside that if one pipeline runs a provider it will be run regardless of whether any of its symbols relate to that pipeline.
- I'm OK with this because these are currently theoretical scenarios and if they become real we can look for optimizations.
Change the API so that symbols are always associated with a pipeline. For example, we could require that all symbols be created from their parent symbol and the root of a subsystem (non-core) app be created from a pipeline.
- Hard not to cringe at such a massive API change to the core action of a CLI author. If we seriously consider this route, we should check behavior when symbols are created via called methods, with conditionals, from external sources, etc. Overall, its probably doable but seems an unnecessary pain.

Note on a later optimization for multiple pipelines:

If this scenario materializes, not supplying an optimization would result in all the descriptions for one pipeline being loaded because another pipeline needed its descriptions. This could be problematic.

A simple optimization would be to create a provider that knew the root of its associated CLI tree. This assumes that each pipeline is specific to a CLI tree, but I am OK with putting restrictions on such a niche scenario. Since each symbol can find its root during parsing but not necessarily before, this would also require handling of an unattached symbol, probably by just not running the provider.

My current opinion is that providers should be managed the same way we manage the underlying data, which is globally. If anything else at all needs to be moved globally, we should take a close look at design. But the relationship between providers and data storage is so intimate, I think we should have them similarly placed and static.

Which leads to the weird idea that providers themselves should be in the strongly typed weak table. That might require a common class for symbols and providers, which seems to be too weird an idea unless managing multiple strongly typed weak tables are difficult and it's not OK to just store providers, which are rare, in normal storage.

Looking forward to comments.

KathleenDollard commented 1 month ago

Storing providers with the annotations would probably allow us to block access to annotations, except through a pathway that always respected providers. I think this would be highly desirable.

mhutch commented 1 month ago

I don't hate the idea of providers being global but I do have a few thoughts/concerns.

It would certainly simplify the API so that subsystems don't have to use a different getter than authors. However, I think there may be value in having a distinction between "get the directly assigned annotation" and "get the resolved (direct or provider) annotation.

I don't think reuse of symbols between pipelines is a big concern... Firstly, having multiple pipelines and using providers are both somewhat advanced scenarios, and secondly, I don't think it's actually an issue if the same symbol is used in multiple pipelines and resolves different values for some annotation, as long as providers are clearly associated withpipelines. Devs can use the same provider in multiple pipelines if they want anyways.

Indeed, we may even want to consider altering the provider model to allow contextual resolution as a mechanism for customizing annotations (e g. Description) when a symbol appears multiple times in the same grammar - resolve the value of X annotation for Y symbol in the context of Z parent symbol and W pipeline.

My biggest concern, however, is that making providers global makes it difficult/impossible to scope them for GC/disposal.

KathleenDollard commented 1 month ago

I think that the ability to GC what providers create is essential, as that is how we expect people to handle large strings. I can see some providers that would have a small footprint, but some may not. And just in principle, we should strive to allow CLI authors to clean up prior to running high performance apps.

So, I agree, let's keep providers scoped to the pipeline. I remain somewhat concerned that if people get the value of a provider created pseudo property on a symbol from the symbol they will sometimes get a different value than they would get if they asked the pipeline. But I think your intuition that people will understand this if we make the API clear is correct. For example, if direct access on the symbol sometimes returns the value without the provider, it should never return the provider value. Of course, we can also add a pipeline feature that returns the value as used, particularly for debugging.

dotnet / command-line-api