NAOT publish build has a 1MB size delta between reflection free and not (+50% total size)

Sergio0694 commented 1 year ago

Description

I've tried out .NET 8 Preview 1 for my NAOT sample for ComputeSharp (https://github.com/Sergio0694/ComputeSharp), and noticed that the size delta between reflection free and now is particularly large, more than expected (talking with @MichalStrehovsky he mentioned he would've expected the different to be 3-400 KB).

Note: I am already enabling all size saving options on top of reflection free, to make the comparison fair (source):

<PropertyGroup Condition="'$(COMPUTESHARP_SWAPCHAIN_CLI_PUBLISH_AOT)' == 'true'">
  <PublishAot>true</PublishAot>
  <UseSystemResourceKeys>true</UseSystemResourceKeys>
  <InvariantGlobalization>true</InvariantGlobalization>
  <IlcGenerateStackTraceData>false</IlcGenerateStackTraceData>
  <IlcDisableReflection>true</IlcDisableReflection>
  <IlcOptimizationPreference>Speed</IlcOptimizationPreference>
</PropertyGroup>

IlcOptimizationPreference is set to Speed since the delta for that one is just ~50 KB, so not worth it in this scenario.

FYI @MichalStrehovsky @agocke

Reproduction Steps

Clone ComputeSharp (https://github.com/Sergio0694/ComputeSharp)
- For everyone to use the same reference point, let's say we checkout at 940f90df9b683b10a599300c498b38f1a72362a0)
Open /samples/ComputeSharp.Sample.Cli/ComputeSharp.Sample.Cli.csproj, change TFM to net8.0

Build the CLI sample from powershell:

$env:COMPUTESHARP_SWAPCHAIN_CLI_PUBLISH_AOT='true';
dotnet publish samples\ComputeSharp.SwapChain.Cli\ComputeSharp.SwapChain.Cli.csproj -r win-x64

Expected behavior

The delta between reflection enabled and reflection free should be around smaller.

Actual behavior

I get these results:

Reflection free: 1.892 KB
Reflection enabled: 2.821 KB

That's a 929 KB difference (+49.1%).

Other information

Here's the two .map.xml files for both configurations: computesharp.cli.reflection.map.zip.

ghost commented 1 year ago

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas See info in area-owners.md if you want to be subscribed.

Issue Details

### Description I've tried out .NET 8 Preview 1 for my NAOT sample for ComputeSharp (https://github.com/Sergio0694/ComputeSharp), and noticed that the size delta between reflection free and now is particularly large, more than expected (talking with @MichalStrehovsky he mentioned he would've expected the different to be 3-400 KB). Note: I am already enabling all size saving options on top of reflection free, to make the comparison fair ([source](https://github.com/Sergio0694/ComputeSharp/blob/940f90df9b683b10a599300c498b38f1a72362a0/samples/ComputeSharp.SwapChain.Cli/ComputeSharp.SwapChain.Cli.csproj#L27)): ```xml true true true false true Speed ``` `IlcOptimizationPreference` is set to `Speed` since the delta for that one is just ~50 KB, so not worth it in this scenario. FYI @MichalStrehovsky @agocke ### Reproduction Steps - Clone ComputeSharp (https://github.com/Sergio0694/ComputeSharp) - For everyone to use the same reference point, let's say we checkout at 940f90df9b683b10a599300c498b38f1a72362a0) - Open `/samples/ComputeSharp.Sample.Cli/ComputeSharp.Sample.Cli.csproj`, change TFM to `net8.0` - Build the CLI sample from powershell: ```powershell $env:COMPUTESHARP_SWAPCHAIN_CLI_PUBLISH_AOT='true'; dotnet publish samples\ComputeSharp.SwapChain.Cli\ComputeSharp.SwapChain.Cli.csproj -r win-x64 ``` ### Expected behavior The delta between reflection enabled and reflection free should be around smaller. ### Actual behavior I get these results: - Reflection free: 1.892 KB - Reflection enabled: **2.821 KB** That's a 929 KB difference (**+49.1%**). ### Other information Here's the two .map.xml files for both configurations: [computesharp.cli.reflection.map.zip](https://github.com/dotnet/runtime/files/10825449/computesharp.cli.reflection.map.zip).

Author:	Sergio0694
Assignees:	-
Labels:	`untriaged`, `area-NativeAOT-coreclr`
Milestone:	-

jkotas commented 1 year ago

DebuggerSupport is enabled by default that in turn brings in a lot of reflectable cruft, for example: https://github.com/dotnet/runtime/blob/57bfe474518ab5b7cfe6bf7424a79ce3af9d6657/src/libraries/System.Linq/src/ILLink/ILLink.Descriptors.xml#L2-L7 You can shave 350kB by adding <DebuggerSupport>false</DebuggerSupport> to your sample. Should we set DebuggerSupport to false by default for NativeAOT publish? Blazor has prior art https://github.com/dotnet/sdk/blob/86117ad328025825f25f88389077d9eee2dc9031/src/BlazorWasmSdk/Targets/Microsoft.NET.Sdk.BlazorWebAssembly.6_0.targets#L53.
Delegate targets are reflectable by default. It means that delegates are more expensive than in reflection-free mode. Do we want to have a way to disable delegate reflectability?

The sample uses Linq that amplifies impact of the previous two points. You may want to get rid of Linq from the parts of Community Toolking that are meant to be lean. Here is example of the dependency chain from whydgml tool:

() Reflectable method: [CommunityToolkit.Diagnostics]CommunityToolkit.Diagnostics.TypeExtensions+<>c.<ToTypeString>b__2_2(Type)
(Target of a delegate) __DelegateCtor_S_P_CoreLib_System_Delegate__InitializeClosedInstance__CommunityToolkit_Diagnostics_CommunityToolkit_Diagnostics_TypeExtensions___c___ToTypeString_b__2_2
(ldftn) CommunityToolkit_Diagnostics_CommunityToolkit_Diagnostics_TypeExtensions___ToTypeString_g__FormatDisplayString_2_0
  (call) CommunityToolkit_Diagnostics_CommunityToolkit_Diagnostics_TypeExtensions___c___ToTypeString_b__2_1
    (reloc) __DelegateCtor_S_P_CoreLib_System_Delegate__InitializeClosedInstance__CommunityToolkit_Diagnostics_CommunityToolkit_Diagnostics_TypeExtensions___c___ToTypeString_b__2_1
      (ldftn) CommunityToolkit_Diagnostics_CommunityToolkit_Diagnostics_TypeExtensions__ToTypeString
        (call) CommunityToolkit_Diagnostics_CommunityToolkit_Diagnostics_Guard_ThrowHelper__ThrowArgumentNullExceptionForIsNotNull<System___Canon>
          (call) CommunityToolkit_Diagnostics_CommunityToolkit_Diagnostics_Guard__IsNotNull<System___Canon>
            (call) ComputeSharp_ComputeSharp_GraphicsDeviceExtensions__ForEach_0<computesharp_cli_ComputeSharp_SwapChain_Shaders_Octagrams__ComputeSharp_Core_ComputeSharp_Float4>
              (call) computesharp_cli_ComputeSharp_SwapChain_Backend_SwapChainApplication_1<computesharp_cli_ComputeSharp_SwapChain_Shaders_Octagrams>__OnUpdate
                (Virtual method) (??_7computesharp_cli_ComputeSharp_SwapChain_Backend_SwapChainApplication_1<computesharp_cli_ComputeSharp_SwapChain_Shaders_Octagrams>@@6B@ constructed, VirtualMethodUse [computesharp.cli]ComputeSharp.SwapChain.Backend.Win32Application.OnUpdate(TimeSpan))
                  (Primary) ??_7computesharp_cli_ComputeSharp_SwapChain_Backend_SwapChainApplication_1<computesharp_cli_ComputeSharp_SwapChain_Shaders_Octagrams>@@6B@ constructed
                    (newobj) computesharp_cli_Program___Main__

Sergio0694 commented 1 year ago

Ooh that's very interesting, thank you! 😄

"You can shave 350kB by adding <DebuggerSupport>false</DebuggerSupport> to your sample."

TIL, I can certainly do that! FWIW it does make sense to me for it to be on by default in published builds.

"Delegate targets are reflectable by default. It means that delegates are more expensive than in reflection-free mode. Do we want to have a way to disable delegate reflectability?"

Of course I can't comment on how complex this would be to enable, but having an option for that (assuming the linker can't just automatically strip that support if it seens that you're never reflecting on delegates) certainly sounds nice. Do you also have an idea of the amount of size (like, ballpark) such an option could save in an example like this? 🤔

"The sample uses Linq that amplifies impact of the previous two points. You may want to get rid of Linq from the parts of Community Toolking that are meant to be lean. Here is example of the dependency chain from whydgml tool:"

Oh that completely slipped my mind - those are in a slow path (that code is only used to format a nice exception message if a throw helper fails), but of course that'd still cause more stuff to be included in all cases, right. I can definitely remove that and see how much it helps, and then I guess I could also just drop that package reference entirely here and just use some internal throw helpers - since this is a library and not an app I guess having one less dependency isn't a bad idea anyway.

Will share updated numbers here when I have time to make the changes and re-run the tests 🙌

jkotas commented 1 year ago

Do you also have an idea of the amount of size (like, ballpark) such an option could save in an example like this?

I have seen it on number of paths. It is hard to guess the absolute number by looking at the dependency logs. The easiest way to find out the absolute number would be to build the compiler with the delegate reflectability commented out and check the difference.

KeterSCP commented 1 year ago

I also noticed some size regression on simple Hello World app when using following settings:

<PropertyGroup>
    <PublishAot>true</PublishAot>
    <DebuggerSupport>false</DebuggerSupport>
    <InvariantGlobalization>true</InvariantGlobalization>
    <UseSystemResourceKeys>true</UseSystemResourceKeys>

    <IlcOptimizationPreference>Size</IlcOptimizationPreference>
    <IlcFoldIdenticalMethodBodies>true</IlcFoldIdenticalMethodBodies>
    <IlcDisableReflection>true</IlcDisableReflection>
    <IlcGenerateCompleteTypeMetadata>false</IlcGenerateCompleteTypeMetadata>
    <IlcGenerateStackTraceData>false</IlcGenerateStackTraceData>
</PropertyGroup>

.NET 7 size: 968 KB .NET 8 size: 1095 KB

Sergio0694 commented 1 year ago

On the "disable delegate reflectability" topic - would such an option also require adding some annotations to make it safe? Eg. I assume Delegate.Method would have to be marked as nullable, as that metadata info would could then be removed? 🤔

jkotas commented 1 year ago

I assume Delegate.Method would have to be marked as nullable

That would be unacceptable breaking change. Delegate.Method would throw exception with this option. To make it "safe", the compiler can produce warnings/errors for places that call it.

jkotas commented 1 year ago

Or we can build it as an optimization that kicks in automatically: disable the delegate target reflectability when the program does not contain any Delegate.Method calls.

Sergio0694 commented 1 year ago

Alright I made a PR to drop the CommunityToolkit.Diagnostics dependency entirely and re-run the NAOT publishing 🙂

I got the following (with reflection enabled):

2.493 KB (so -328 KB)
computesharp.cli.map.xml

And if I also set <DebuggerSupport>false</DebuggerSupport>:

2.441 KB (so uh... -52 KB?)
computesharp.cli.map.xml

And for reference, this is the result if I leave reflection free mode set in that branch:

1.850 KB (so another 42 KB down from main)
computesharp.cli.map.xml

I'm very confused as to why disabling debugger support only saves another 52 KB on this branch 🤔

Anyway with the changes from that branch it seems the delta between reflection free is down to 591 KB. That's still seems considerable given in this case it's about a 32% size increase.

@jkotas let me know if you find anything else that you find interesting from the new map file from that branch, curious if that might show some other offenders that might've been less noticeable before. Also more than happy to help test some new compiler flag for that delegate reflection mode you mentioned, in case anyone gets to implement that 🙂

And thank you for spotting that issue with those Toolkit APIs, I had no idea they were having a noticeable impact here.

jkotas commented 1 year ago

I'm very confused as to why disabling debugger support only saves another 52 KB on this branch

DebuggerSupport impact was amplified by CommunityToolkit.Diagnostics dependency and Linq. Now that these are gone, the impact is less.

Sergio0694 commented 1 year ago

Right, that makes sense. I guess my surprise was due to how that one use case and dependency was contributing so much compared to all the rest of the code involved, which is still a very considerable amount (both in ComputeSharp and the TerraFX.Interop.* dependencies in particular). I mean I found it very interesting give that was literally just a single method ending up rooting all that extra fluff, is all 😄

Sergio0694 commented 1 year ago

"Delegate.Method would throw exception with this option."

We could throw a MissingMetadataException for old times' sake 😆

/s

Sergio0694 commented 1 year ago

Run some more tests with the options mentioned in https://github.com/dotnet/runtime/issues/82607#issuecomment-1444625047 as well (thanks @KeterSCP!):

<IlcFoldIdenticalMethodBodies> set to false: 1.780 KB (saved 70 KB)
- Just curious, why is this not enabled by default, does it have any downsides?
<IlcGenerateCompleteTypeMetadata> set to false: same, I think it's included by reflection free already?
- Also is this one documented and/or supported?
Both of them, but with reflection enabled: 2.327 KB (so -114 KB from the previous best with reflection enabled)
Just <IlcGenerateCompleteTypeMetadata>, with reflection enabled: 2.441 KB, so same as previous test
- It would seem this setting just never does anything both in reflection enabled and reflection free mode

I'll take the 70 KB of extra improvement with <IlcFoldIdenticalMethodBodies> 😄

Current deltas with updated settings for both scenarios:

Reflection free: 1.780 KB
Reflection enabled: 2.327 KB (so 547 KB of delta, ie. +30.7%)

And looking forwards to seeing that decrease even more in the future 🙌

jkotas commented 1 year ago

<IlcFoldIdenticalMethodBodies> Just curious, why is this not enabled by default, does it have any downsides?

It breaks Delegate.Method and it makes stacktraces - both under debugger and at runtime (e.g. in Exception logs) - look non-sensical.

jkotas commented 1 year ago

<IlcGenerateCompleteTypeMetadata> set to false: same,

IlcGenerateCompleteTypeMetadata is false by default everywhere. It is a troubleshooting aid documented at: https://github.com/dotnet/runtime/blob/main/src/coreclr/nativeaot/docs/optimizing.md#options-related-to-trimming

Sergio0694 commented 1 year ago

I see, thank you! 🙂

Some more additional info: I double checked with .NET 7 as well following up on what @KeterSCP said, and noticed I also had a slight size regression in .NET 8, though only in reflection free mode. Here's a full table of results in case it's useful:

TFM/SDK	Reflection	Size
.NET 7	enabled	2.775 KB
.NET 8	enabled	2.327 KB
.NET 7	disabled	1.714 KB
.NET 8	disabled	1.780 KB

So in each category (reflection enabled/disabled):

.NET 8 reduced size by 448 KB with reflection enabled
.NET 8 regressed size by 66 KB with reflection disabled

Surprising that .NET 8 both improved (quite a bit, that's nice!) with reflection enabled, yet it does worse with it disabled.

For context, these are the settings I've used to produce all numbers in this table:

<DebuggerSupport>false</DebuggerSupport>
<UseSystemResourceKeys>true</UseSystemResourceKeys>
<InvariantGlobalization>true</InvariantGlobalization>
<IlcFoldIdenticalMethodBodies>true</IlcFoldIdenticalMethodBodies>
<IlcGenerateStackTraceData>false</IlcGenerateStackTraceData>
<IlcOptimizationPreference>Speed</IlcOptimizationPreference>

Of course, with <IlcDisableReflection> being added on top of these in some cases.

jkotas commented 1 year ago

size regression in .NET 8 only in reflection free mode

We do not pay a lot of attention to reflection free mode. I remember a few changes that regressed the reflection free mode. It is actually becoming a pain to keep the reflection free mode working as we work on improvement of the default mode.

jkotas commented 1 year ago

.NET 8 enabled 2.327 KB .NET 8 disabled 1.780 KB

Breakdown of this delta:

420 KB of this difference is reflection engine code (System.Private.Reflection.*)
127 KB of this difference is reflection data (reflection data for delegate targets, custom attributes, etc.)

A lot of the reflection engine code is brought in by initialization of the reflection engine during startup - we need to refactor the reflection engine to make this mandatory piece much smaller. Additional parts of the reflection engine code are brought in through various places in the libraries. For example, TerraFX depends on reflection through this:

(Secondary) VirtualMethodUse [S.P.CoreLib]System.Reflection.CustomAttributeData.get_AttributeType()
 (callvirt) S_P_CoreLib_System_Runtime_InteropServices_NativeLibrary__GetDllImportSearchPath
   (call) S_P_CoreLib_System_Runtime_InteropServices_NativeLibrary__LoadLibraryByName
     (call) S_P_CoreLib_System_Runtime_InteropServices_NativeLibrary__Load_0
       (call) TerraFX_Interop_Windows_TerraFX_Interop_Windows_Windows__OnDllImport
         (reloc) __DelegateCtor_S_P_CoreLib_System_Delegate__InitializeOpenStaticThunk__TerraFX_Interop_Windows_TerraFX_Interop_Windows_Windows__OnDllImport__S_P_CoreLib_System_Runtime_InteropServices_DllImportResolver__InvokeOpenStaticThunk
           (ldftn) TerraFX_Interop_Windows_TerraFX_Interop_Windows_Windows___cctor

(This is output of whydgml tool.)

Sergio0694 commented 1 year ago

Ooh very interesting, thank you for taking another look! 👀

Just so I understand then, potential areas for improvements are:

(NAOT) General work to shrink the size of System.Private.Reflection.* on its own (for those 420 KB)
(NAOT) Potentially improving the linker and/or having more options to trim more aggressively (eg. delegate metadata)
(Application) Investigating other bits that might be accidentally rooting reflection stuff

cc. @tannergooding about that NativeLibrary support rooting reflection stuff. It seems unfortunate that that's always brought in (I assume it's because that hoot is set in the static constructor?), even if the consuming application isn't using that feature at all (eg. my CLI sample isn't using that). Is there a way to make that hook be pay for play in TerraFX? 🤔

Or alternatively, is this something that ILLink could potentially improve and be able to trim on its own?

"(This is output of whydgml tool.)"

I'm not familiar with that tool, but @SingleAccretion kindly pointed me towards some docs for it (he mentioned debugging-aot-compilers.md). How should that be used to identify callsites that are rooting reflection stuff in an application, do I just run that and look for any [S.P.CoreLib]System.Reflection.* bits in any of those call chains?

tannergooding commented 1 year ago

There's a few different options.

The tooling should likely understand the hook and treat it as unused if its statically resolved all native imports or if they want to indicate they won't provide a hook/etc.

But, TerraFX can itself expose some static readonly bool switch and ComputeSharp can tell NAOT how to treat it in the interim (this is how things like InvariantGlobalization work behind the scenes: https://github.com/dotnet/runtime/blob/main/docs/workflow/trimming/feature-switches.md and https://github.com/dotnet/runtime/blob/main/src/coreclr/nativeaot/docs/optimizing.md).

jkotas commented 1 year ago

There's a few different options.

Can you move the registration of the hook out of the static constructor and do it lazily only once somebody subscribes to the ResolveLibrary event handler?

What are the scenarios that the ResolveLibrary event handler is for?

The tooling should likely understand the hook

There are many rarely used APIs that the tooling can understand to trim some extra code. I do not think we will ever want to complicate the tooling for that. Libraries should avoid calling these APIs when do not need to call them in the first place.

Sergio0694 commented 1 year ago

"Can you move the registration of the hook out of the static constructor and do it lazily only once somebody subscribes to the ResolveLibrary event handler?"

That seems like a very good idea definitely worth trying out 👀

I assume you mean something like this?

private static event DllImportResolver? _resolveLibrary;

public static event DllImportResolver? ResolveLibrary
{
    add
    {
        SetupFirstTimeHookIfNeededWithALock();

        _resolveLibrary += value;
    }
    remove => _resolveLibrary -= value;
}

"What are the scenarios that the ResolveLibrary event handler is for?"

It can be used by consumerss to properly resolve dependencies they're bundling with their applications (eg. for redistributable libraries such as dxcompiler.dll and dxil.dll) which might not always be right next to the executable, and would otherwise crash. Eg. I'm using this in ComputeSharp.Dynamic (see here) (which supports runtime shader compilation and as such needs those two DLL-s I just mentioned) to ensure they're properly resolved in all possible scenarios (self-contained, local build, Debug/Release, through NuGet, etc.).

tannergooding commented 1 year ago

What are the scenarios that the ResolveLibrary event handler is for?

Native library resolution, particularly for cross platform code or Linux in general, can be particularly tricky.

Consider vulkan where on Windows it is vulkan.dll or vulkan-1.dll, depending. Or on Linux where it can be libvulkan.so.1 vs libvulkan.so vs libvulkan.so.1.3.239, vs ...

This allows app authors to customize the resolution logic to ensure that they can succeed even when running on a setup where the default resolution logic might fail. -- It's also required so that we can have one set of bindings and still resolve on a platform where the default search names don't always work.

Can you move the registration of the hook out of the static constructor and do it lazily only once somebody subscribes to the ResolveLibrary event handler?

Yes, but that's also slightly problematic. The way SetDllImportResolver works is that you get exactly one hook per Assembly. Once it's been set, you cannot remove the hook and other people cannot provide their own hooks/fallbacks for that assembly.

TerraFX effectively "has" to set it up front because of this, as if it didn't there wouldn't be a single location for everyone to provide extended resolution logic through and its own per platform resolution logic would also not light up (this is less important for TerraFX.Interop.Windows, but more important for other interop bindings provided).

jkotas commented 1 year ago

SetDllImportResolver works is that you get exactly one hook per Assembly

Yes, it was meant for scenarios where the library itself wants to control where its dependencies are resolved from.

AssemblyLoadContext.ResolvingUnmanagedDll is more appropriate for situations where the default resolution logic does not work for the app and the apps wants to augment the logic.

jkotas commented 1 year ago

Yes, but that's also slightly problematic. The way SetDllImportResolver works is that you get exactly one hook per Assembly. Once it's been set, you cannot remove the hook and other people cannot provide their own hooks/fallbacks for that assembly.

If somebody wants to break this, they can still get their hook registered before the TerraFx one. I do not think you need to be protecting against people registering the hook for your library when they are not supposed to.

tannergooding commented 1 year ago

AssemblyLoadContext.ResolvingUnmanagedDll is more appropriate for situations where the default resolution logic does not work for the app and the apps wants to augment the logic.

Right. But the general issue is that the library is most often not the determiner here. It's often a factor of the consuming application and environment instead and there are many scenarios where an application may itself need to tweak or customize resolution logic, particularly when considering plugins or other scenarios.

Even for Windows, you can scenarios like d3dcompiler_*.dll where * is the number of the "latest" release (and with Windows shipping _33 through _43 and _47 today).

There is a general purpose alternative via AppContext switches. This is how all of the S.P.Corelib and other trimming options end up working. Such a setup then works for "any" of the interop binding libraries and is opt-in by the application, not just the ones that by default don't have resolution customization.

I'd personally prefer that to be used as the approach so that my own libraries can be self-consistent with eachother.

jkotas commented 1 year ago

There is a general purpose alternative via AppContext switches. This is how all of the S.P.Corelib and other trimming options end up working.

In general, these AppContext switches are workarounds for legacy designs - to avoid breaking 20 years old APIs. If we were to do this again, we would design the APIs such that they are pay-for-play by design and these AppContext switches are not necessary for pay-for-play.

tannergooding commented 1 year ago

Everyone (eventually) has "legacy design" and everyone (eventually) has functionality that they would like to make opt-in or pay for play so they are being used for pay for play features regardless.

The JIT handling static readonly bool as constant via rejit and AppContext switches being trivially settable via runtimeconfig makes this a very simple and fairly natural pattern.

If not AppContext switches then there needs to be an alternative for developers to achieve the same general effect that way certain code can be trimmed via opt-in. This ranges from loggic, to additional validation features, to features that may be less frequently used but which exist to handle known edge cases that consumers do hit, but which may be unused in the scenario a particular consumer is in.

jkotas commented 1 year ago

People can certainly use the appcontext switches just like CoreLib. I am just saying that it is the less desirable option with worse experience, and that the designs that do not require these should be preferred.

tannergooding commented 1 year ago

For me, the currently suggested alternative is not an available design because it makes the general experience of TerraFX.Interop.Windows inconsistent with the other TerraFX.Interop.* libraries I provide.

Anything done for TerraFX.Interop.Windows needs to also be a valid option to achieve the same general desired outcome with TerraFX.Interop.Vulkan or TerraFX.Interop.Xlib or TerraFX.Interop.*. -- In this case, the desired outcome is the trimming of the SetDllImportResolver call.

If there is another alternative that would work equally for all the required scenarios, so the general end-user experience can be consistent, then I am happy to hear it.

jkotas commented 1 year ago

I do not understand. Where is the inconsistency with registering SetDllImportResolver late only once the app wants to customize the .dll loading? Note that the customization of .dll loading always brings in some reflection - there is no way around it.

tannergooding commented 1 year ago

The root issue here is that a consumer of TerraFX.Interop.* is in a scenario where they do not need to customize the DllImport resolution logic and therefore would like to see it trimmed.

Switching it to be "lazy" only when a user first does ResolveLibrary += Handler works for setups like TerraFX.Interop.Windows where there is not currently any per platform configuration required. It does not then also work for a setup like TerraFX.Interop.Vulkan where there exists a default handler to catch the various names which "might" exist. It also then "breaks" if Windows ends up changing to needing some default handler.

Thus a consumer of TerraFX.Interop.Vulkan is effectively stuck with the inability to trim said support, even if they were shipping vulkan SxS with the app and ensuring that it matched the default lookup name.

This provides an inconsistent experience between the various TerraFX.Interop.* libraries and if such support were added to Vulkan in the future, there would now be two ways to get the functionality for TerraFX.Interop.Windows.

jkotas commented 1 year ago

I would not describe this as an inconsistent experience.

TerraFX.Interop.Windows does less by default, so it can bring in less by default (it does not today). No AppContext switches should be needed for that.

TerraFX.Interop.Vulkan does more by default, so it has to bring in more by default. If it is important for the users of your library to avoid this cruft where possible while maintaining the current behavior, you need to have opt-in mechanisms (AppContext switch).

tannergooding commented 1 year ago

I would not describe this as an inconsistent experience.

I think we'll just have to disagree on this point.

To me, the general desired experience exists for both and Interop.Windows needing to do more today is an implementation detail that most users will not care about. They will simply see that consuming the two libraries has subtle differences in how things work and what they must do to achieve the same general desired goal.

If an AppContext switch is the only solution for Interop.Vulkan and it is an equally viable solution for other scenarios, then using that provides an overall consistent experience across both. It also ensures that the Interop.Windows doesn't have to worry about breaking that "different" support via future changes (e.g. if it needs to start providing "default" resolution logic for one of the libraries it supports).

This could be made a lot simpler on the end user as well by simplifying the experience around setting and trimming AppContext switches.

That is, the linker/trimmer/aot could instead make this as simple as just <RuntimeHostConfigurationOption Include="TerraFX.Interop.Windows.DisableResolveLibraryHook" Value="true" Trim="true" /> and then recognizing the basic pattern around AppContext.TryGetSwitch("constantstring", out var result). There shouldn't be an additional requirement to also define an explicit ILLink.Substitutions.xml file be provided and it duplicate much of the relevant logic.

Provided that TryGetSwitch handles constant strings and provided the version bubble is correct, it provides a nice, simple, and general purpose way for anyone to get this functionality.

jkotas commented 1 year ago

If an AppContext switch is the only solution for Interop.Vulkan and it is an equally viable solution for other scenarios, then using that provides an overall consistent experience across both

That's not how we approach this sort of problems in dotnet/runtime and dotnet/aspnetcore libraries. We always prefer a solution without appcontext switch that just works out of the box in the common cases that we care about. We use appcontext switches only when the solution without appcontext switch is not viable. The solutions differ on library-by-library and case-by-case basis. We can certainly build more appcontext switches. We are intentionally not doing that, as appcontext switches provide worse experience. It also means that we are not very motivated to make implementation of appcontext switches simple.

You are certainly free to use different approach for your libraries.

tannergooding commented 1 year ago

I think the difference is that the BCL is often providing wildly different scenarios where there is not a close correlation to how something might be supported or desired in one vs another.

In my case, I'm providing interop binding libraries and while what they bind against might differ, the general look and feel as well as end user experience and layout of them is very similar/consistent.

So while I understand that something like Globalization vs Json in the BCL might have wildly different exposure mechanisms, the same doesn't necessarily hold true for two the interop libraries or two similar config switches within Json itself or in Json vs another serialization library.

tannergooding commented 1 year ago

I've put up https://github.com/terrafx/terrafx.interop.windows/pull/341 which enables the resolver to be trimmed via opt-in.

I also noted a few different edge cases in the trimmer:

The static constructor is left with an if (true) ; block, so it isn't fully removed (this represents additional jit work and also prevents trimming things that are now fully dead)
Explicit removal of the static constructor doesn't work, it results in incorrect IL metadata
While you can set a static readonly T (for primitive T) to be a constant value, you cannot then access the field directly. Doing so leaves the initialization logic and other "dead code" still in the assembly. Accessing the field indirectly via a property does work as expected.

jkotas commented 1 year ago

I do not see the difference in what the BCL is providing vs. what your libraries are providing. The differences that I see is in principle approach:

BCL: Best possible trim-friendly behavior out of the box for common case. Use AppContext switches only when absolutely necessary and when the benefit is large.
TerraFX: Suboptimimal trim-friendly behavior out of the box. Prefer use of AppContext switches to get best possible trim-friendly behavior, even for corner cases or when the benefit is small.

tannergooding commented 1 year ago

I wouldn't say its sub-optimal.

The savings here for the TerraFX.Samples.DirectX sample is about 1kb trimmed. It's about 4kb for the default PublishAot experience (without disabling any runtime knobs). Neither of those are large and so its something that's being provided due to a "customer ask" since its an overall simple and reasonable ask.

It's about providing a consistent experience across the "terrafx ecosystem" (that is all libraries provided by https://github.com/terrafx/*) and consistency is overall a more important feature in my eyes. Providing inconsistency across the ecosystem of libraries I'm exposing just leads to user pain, even for things like this that may be viewed as edge cases. It leads to needing to say "do this, except for when you're in this scenario, then you have to do this other thing instead".

With AppContext switches, it becomes a very simple/trivial of "add this one line to your csproj or dir.build.props": <RuntimeHostConfigurationOption Include="TerraFX.Interop.Windows.DisableResolveLibraryHook" Value="true" Trim="true" />

This is simple for users, its easy to understand, and it works one way everywhere. There are no special annotations or considerations. Simply a "if you want this, do this" and leaves it as unimportant for the general case/scenario.

It also extends to other optional features such zero-cost + trimmable runtime validation, opt-in padding, or stat tracking in TerraFX.Interop.Mimalloc. Which are much more broadly used/applicable

-- As an aside it also looks like initializing static readonly field via the linker substitutions is currently trim incompatible. Doing <field name="s_disableResolveLibraryHook" value="true" /> works, but doing <field name="s_disableResolveLibraryHook" value="true" initialize="true" /> results in:

System.NotSupportedException: Specified method is not supported.
     at ILCompiler.ProcessLinkerXmlBase.ProcessXml(Boolean) in /_/src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/ProcessLinkerXmlBase
  .cs:line 104
     at ILCompiler.BodySubstitutionsParser.GetSubstitutions(TypeSystemContext, UnmanagedMemoryStream, ManifestResource, ModuleDesc, Strin
  g, IReadOnlyDictionary`2) in /_/src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/BodySubstitutionParser.cs:line 168
     at ILCompiler.FeatureSwitchManager.AssemblyFeatureInfo..ctor(EcmaModule, IReadOnlyDictionary`2) in /_/src/coreclr/tools/aot/ILCompil
  er.Compiler/Compiler/FeatureSwitchManager.cs:line 743

jkotas commented 1 year ago

I's about 4kb for the default PublishAot experience (without disabling any runtime knobs)

Yes, the benefit is not large enough to talk about this to library user, ever. The library implementation should do it transparently by default, when possible, without requiring any user intervention. And when it is not possible, just accept the extra overhead.

It leads to needing to say "do this, except for when you're in this scenario, then you have to do this other thing instead"

Yes, you do not want to be explaining things like that.

This is simple for users, its easy to understand, and it works one way everywhere

Is it really easy to understand? People have to understand when it is safe to do this. It is safe to do in some cases, but it can going to break your app or one of the libraries used by your app in other cases. You are back to explaining complicated corner cases.

For reference, this is what the fix would look like when following the BCL principles: https://github.com/terrafx/terrafx.interop.windows/pull/342 . As added benefit, it also removes static constructor triggers.

tannergooding commented 1 year ago

Yes, the benefit is not large enough to talk about this to library user, ever. The library implementation should do it transparently by default, when possible, without requiring any user intervention. And when it is not possible, just accept the extra overhead.

To me this just ends up being "just accept the overhead, always" given how small it is. Because otherwise it will regress the scenario for users if I ever end up pulling in bindings for something that does need dynamic lookup. -- There are notably things that fall into this category today even for Windows, they just haven't been exposed "yet".

Is it really easy to understand? People have to understand when it is safe to do this. It is safe to do in some cases, but it can going to break your app or one of the libraries used by your app in other cases. You are back to explaining complicated corner cases.

I believe so, yes. It ends up not-trimmed by default, so users will not experience any break or extra considerations.

If the app needs to do some extra trimming and does not need the customized logic, then they opt-in via the one line. There is then a singular docs page that describes the available config switches. In the case of DisableResolveLibraryHook it becomes a very simple explanation on "if you do not need customized library resolution logic, you can set this". The explanation of what "customized library resolution logic" does and the scenarios it may impact is simple and then consistent across all TerraFX.Interop.* libraries. -- It's just a small explanation/example of handling cases like vulkan-1.dll vs libvulkan.so.1 by default and that it also enables end users to provide their own loading rules on top.

As added benefit, it also removes static constructor triggers.

The static constructor not being removed with the other approach is a general trimmer issue that needs to be resolved. We have the same general issue for our own substitutions in corelib, including for simple things like if (IntPtr.Size == 8) leaving behind an if (true) ; or if (false) ; block which is functionally "dead code".

The trimmer "needs" to work with code patterns and practices that average developers and real world apps write. It should not require them to contort their code to make the trimmer happy (at least not for relatively simple cases like these).

jkotas commented 1 year ago

Because otherwise it will regress the scenario for users if I ever end up pulling in bindings for something that does need dynamic lookup.

Again, this is different approach for BCL vs. TerraFX. We regress size of existing scenarios all the time by adding new BCL features, it does not really bother us (unless it is very large). It does not stop us from optimizing the runtime and BCL for where things are today, even when some of these optimizations may be invalidated in future.

tannergooding commented 1 year ago

Right. We'll just have to leave it as difference of opinion.

I've at least provided a switch that works for power users, like Sergio, and which will equally work for the other TerraFX.Interop.* libraries once I get corresponding PRs up. The default user experience will be that this isn't trimmed and where the 1-4kb will likely not matter to the typical user of these libraries.

I've correspondingly logged some bugs on dotnet/linker for the things which didn't work as expected and where various dead blocks were left in the trimmed output.

MichalStrehovsky commented 1 year ago

For example, TerraFX depends on reflection through this:

I think this one is self-inflicted. There's no point in calling NativeLibrary.Load from here:

https://github.com/terrafx/terrafx.interop.windows/blob/27747db0c0fc4bcf653e3f7e07feb500ad5e6e14/sources/Interop/Windows/Windows.cs#L28

it can just return IntPtr.Zero. There's a 200 kB difference in a hello world app that calls Load vs returns IntPtr.Zero.

Returning Zero will go into a lean code path because we know what assembly we're operating on when we call the callback. If someone goes and manually calls NativeLibrary.Load, unfortunately we don't have a way to get rid of the custom attribute reading since the API design for that one fell into the "optional parameters that are expensive to calculate" trap. Such APIs should be separate overloads. Optional parameters are not trimmable.

tannergooding commented 1 year ago

There's a 200 kB difference in a hello world app that calls Load vs returns IntPtr.Zero.

Hmmm. I'm only seeing a 1kb difference for the default trimming of the TerraFX samples app (4kb for AOT).

I had also thought that returning non-zero was required, but that doesn't actually look to be the case, so maybe I'm misremembering. Will adjust since it looks to be unnecessary now anyways: https://github.com/terrafx/terrafx.interop.windows/pull/343

MichalStrehovsky commented 1 year ago

Hmmm. I'm only seeing a 1kb difference for the default trimming of the TerraFX samples app (4kb for AOT).

Is this for default settings or reflection disabled? I'm seeing the big difference for all left at defaults. Reflection disabled wouldn't make a big difference because custom attribute reading won't be present anyway (and if the nullable parameter was passed as null, it would likely throw).

tannergooding commented 1 year ago

This is just the defaults, not setting any special runtime feature switches or AOT knobs.

MichalStrehovsky commented 1 year ago

This is just the defaults, not setting any special runtime feature switches or AOT knobs.

On .NET 8?

I'm seeing:

$ dotnet new console -o foo
$ cd foo

Paste this into Program.cs:

using System;
using System.Reflection;
using System.Runtime.InteropServices;

class Program
{
    static void Main()
    {
        NativeLibrary.SetDllImportResolver(Assembly.GetExecutingAssembly(), (s, a, p) =>
        {
            return NativeLibrary.Load(s, a, p);
            //return IntPtr.Zero;
        });

        MessageBoxW(default, default, default, default);

        [DllImport("user32")]
        static extern int MessageBoxW(IntPtr a, IntPtr b, IntPtr c, int d);
    }
}

$ dotnet publish -p:PublishAot=true
$ dir C:\foo\bin\Release\net8.0\win-x64\publish\foo.exe
02/27/2023  12:52         2,002,944 foo.exe

Then uncomment/comment the Load/Zero lines:

$ dotnet publish -p:PublishAot=true
$ dir C:\foo\bin\Release\net8.0\win-x64\publish\foo.exe
02/27/2023  12:52         1,797,120 foo.exe

You probably won't need the ILLink substitution stuff anymore.

MichalStrehovsky commented 1 year ago

I double checked with .NET 7 as well following up on what @KeterSCP said, and noticed I also had a slight size regression in .NET 8, though only in reflection free mode

Pretty sure it's because we now have a generic virtual method in the closure and that brings the type loader. Previously, reflection free mode didn't include the type loader - there's an extra "optimization" related to that here:

https://github.com/dotnet/runtime/blob/2b74314b3c049f5112e526a27a8cac74f2a62d43/src/coreclr/nativeaot/BuildIntegration/Microsoft.NETCore.Native.targets#L227

without this optimization the type loader would be present no matter what. We might want to delete the optimization - it's just an artifact of the past, when reflection disabled mode also broke the type loader (https://github.com/dotnet/corert/pull/7208#issuecomment-475986552).

What this optimization does is that it breaks Comparer.Default and EqualityComparer.Default. Nobody noticed that with reflection disabled because breaking code is par for the course when reflection is disabled.

tannergooding commented 1 year ago

On .NET 8?

No, for me its on .NET 7. I've not made the switch to start supporting .NET 8 yet. I typically do that around Preview 3.

dotnet / runtime