Open kg opened 2 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Tagging subscribers to 'arch-wasm': @lewing See info in area-owners.md if you want to be subscribed.
Author: | kg |
---|---|
Assignees: | - |
Labels: | `arch-wasm`, `untriaged`, `tracking` |
Milestone: | - |
Could we store the IR-form (with a hash) into some browser cache for the next startup ?
cc @fanyang-mono for the first item.
Startup I/O is covered by the memory snapshot, I believe? IR caching could be too if we move the snapshot later in startup.
This issue tracks various parts of WebAssembly startup performance that need investigation and describes some potential solutions.
Current list of items from examining a small application (raytracer) as of 2024-03:
response.arraybuffer()
andfetch()
both take a long time during startup. can we make them faster? (~989ms + ~598ms)response.blob().stream().getReader().read()
could be used to read responses directly into the wasm heap for slightly faster startup. it appears this only works in Firefox and Chrome, but it's worth testing.strcmp
for metadata lookups, etc. vectorizing it could help (~28ms)memset
andmemcpy
; most of it is from emscripten's implementation ofmmap
, which is used by sgen to allocate heap. (~130ms)-mbulk-memory
to enable a faster/smaller version of libc memory operations, which will improve on this_emscripten_get_now
is very expensive, which makesmono_time_track_xxx
very expensive. (~250ms)monoeg
hash table lookups and insert operations (> 80ms)mono_class_implement_interface_slow
is hot, most of this is during vtable setup (~97ms)mono_class_has_variant_generic_params
, which seems cacheableinterp_transform_method
is approximately 2/3 of the total timegenerate_code
is approximately 1/3interp_optimize_code
is around 175ms, a rounding error in comparisonmono_metadata_decode_row_col_raw
~24msmono_metadata_parse_type_internal
~57msdecode_value
andtable_locator
noise issues (no reliable number) but frequently present around ~1-2% combinedtable_locator
does a lot of redundant work that could be hoisted out of itsmono_binary_search
caller into outer call sites https://github.com/dotnet/runtime/pull/100157free
during startup; many of these could be optimized out via smart use of arenas/mempoolstext()
(~93ms) and is overall very expensive (~371ms)interp_create_method_pointer_llvmonly
(~14ms of execution time, ~13ms of which is justutf8ToString
)mono_wasm_get_assembly_exports
is heavy, at least in AOT (~180ms)mono_wasm_bind_assembly_exports
JSMarshalerArgument.AssertCurrentThreadContext
JSProxyContext.cctor
and aot initializationmono_runtime_class_init_full
for the module's generated interop initializer (#99924)System.Version.TryParse
GetCustomAttribute
JSExportGenerator.cs
) needs to check whether the current runtime is NET7 and do specific behavior if so. We have to do this so that nugets will work on older runtimesJSFunctionBinding.BindManagedFunction
string.LastIndexOf
which flows through toCompareInfo.ctor
and hits ICU stuff (#99924)SharedArrayPool
lookupPath
mono_runtime_install_appctx_properties
Profiles of startup for the blazor 8.0 samples: interpreted, high precision, firefox: https://profiler.firefox.com/public/4tgsv0kh4xvgrckp4w9dcqvyen8x1ftnm8df5d8/calltree/?globalTrackOrder=0&invertCallstack&thread=0&transforms=cr-combined-16-43063~cr-combined-13-43060~cr-combined-24-43071~cr-combined-15-43062~f-combined-0cjnxyb&v=10 aot, low precision, chrome: https://profiler.firefox.com/public/kbd0e1vks074a5af67g2ntzwvwbx20mhgag5rvr/calltree/?globalTrackOrder=0w3&hiddenGlobalTracks=023&hiddenLocalTracksByPid=37252-0w4~29452-0ws~1948-0w4~0-0&invertCallstack&thread=6&timelineType=category&transforms=df-31~mf-193~mf-265~mf-191~mf-190~mf-294~df-1&v=10
Profiles of startup for @maraf 's Money application on 9.0 preview 2: interpreted, high precision, firefox: https://share.firefox.dev/3TRtKkM aot, low precision, chrome: https://share.firefox.dev/3IQUT0P
Archived work items from the past: