Update symbol naming scheme to avoid long and duplicate names.

brson commented 1 year ago

This fixes two problems with symbol naming.

Name collisions

Two functions with the same module and function name, but different addresses, will have the same symbol name, and generate a link error. Example

module 0x1::foo {
  public fun a(): u32 {
    2
  }
}

module 0x2::foo {
  public fun a(): u32 {
    2
  }
}

Long symbols

rbpf supports symbol names up to 64 characters (63 + a nil byte). Our current symbol naming will easily generate symbol names that are too long.

This patch uses an encoding scheme that guarantees short and unique symbols. That scheme is described fully in the comments.

Here is an example of a symbol generated by this scheme:

0000000000000010_tests_test_vec_struct_71fWuFLGmmLpqR

It is different from the one suggested in https://github.com/solana-labs/move/issues/378#issuecomment-1728397377 for a few reasons:

Type params need to be included. Here they are just part of the hash, not part of the readable name.
I included three visual separators for readability.
I stuffed every datum into a single hash instead of multiple hashes. The main downside to this is it is not possible to perfectly identify e.g. which module a symbol is from by looking at a dedicated module hash. The upside is that it allows an arbitrary amount of data to be stashed in the one hash (like the type params) without spending bytes on separate hashes.

I also allocated 15 bytes to each of the module name and the function name. I think it is arguable that the module name is less important and often short compared to the function name, and those bytes could be reduced to add bytes elsewhere. e.g. the hash here is significantly truncated, so bytes could be added to it, but I also don't feel strongly that more bytes elsewhere will meaningfully improve this scheme.

Encoding the address and hash into all symbols ensures that all symbols are fairly long, so there could be concern about binary size. The only symbols names that will appear in the final binary though are ones that need to be relocated, which currently seems to include only public functions. If the compilation model changes in the future, e.g. using LTO to combine compilation units (or just compiling all modules as one compilation unit to begin with), we could possibly avoid those relocations, but I am not sure.

All the work needed to generate symbols here is arguably inefficient and could be cached for later lookups, but with our small workloads I am not thinking it matters. In casual testing rbf-tests takes approximately the same time to execute after this patch.

This uses blake3 for the hash because it is strong and fast, and base58 to encode the hash to valid symbol names.

It leaves alone the naming of the entrypoint symbols as I don't understand that code enough to know if and how it should change.

Fixes https://github.com/solana-labs/move/issues/303 Fixes https://github.com/solana-labs/move/issues/378

dmakarov commented 1 year ago

This looks good to me. It seems that RBPF entrypoint tests need to be updated to pass correct entrypoint function name in instruction_data.

brson commented 11 months ago

This looks good to me. It seems that RBPF entrypoint tests need to be updated to pass correct entrypoint function name in instruction_data.

This patch doesn't change the names of the entry point functions since I wasn't sure whether or how I should change them, so as of now I don't think any of the tests need to change, unless I'm misunderstanding.

brson commented 11 months ago

I've addressed the reviews and fixed the expected IRs to pass CI.

anza-xyz / move

Update symbol naming scheme to avoid long and duplicate names. #382

Name collisions

Long symbols