[NEW]: The Ultimate 4-byte Function Selector Database

kuzdogan commented 2 years ago

Introduction

I am Kaan, the maintainer of Sourcify and an organizer at ETHBerlin.

I won't be hacking myself during the event but wanted to share this idea in case anyone finds it interesting and wants to build. Happy to help if you have questions.

Abstract

Create a really large 4-byte function selector database by permutating the existing function names and function arguments. This would make things easier for contract reverse-engineering and inferring the likelihood of the argument names of unverified contracts.

Idea description

Application Binary Interface (ABI) is the standard way to interact with the contracts on Ethereum. A function call to a contract contains the 4-byte function selector and the arguments/parameters of the function encoded in a specific way.

The 4-byte function selector is the first 4-bytes of the hash of the function signature, which is the canonical representation of the function without the variable names and the return value. For instance for the function

transfer(address _to, uint256 _value) returns bool success

function signature: transfer(address,uint256)
hash of the function signature: a9059cbb2ab09eb219583f4a59a5d0623ade346d962bcd4e46b11da047c9049b
function selector: a9059cbb

Ideally, a contract's source code is public and verified e.g. on Sourcify so that a user would have all the information at hand to construct a function call. This information can also be found in the ABI JSON of a contract in a structured way (which sometimes is referred to as "the ABI"). For a contract that is not verified, users can interact with it using the information in the ABI JSON.

The fact that function selectors are hashes has two problems:

It is not possible to know the function name and variables from the selector. It's just an identifier.
There are collisions.

That's why the community came up with function signature databases such as 4byte.directory or ethereum-lists/4bytes. Currently 4byte.directory holds close to 1M function+event signatures. If you see a calldata starting with 0x3416f9d4 and it is found in the database as subtractSafely(uint256,uint256), you'd know that it is highly likely a subtraction function. However for common func. names such as transfer people submit colliding function names (remember "At Inversebrah"?). The existing directories are simple hash tables and don't hold the number of observed occurrences of the original function signature for a function selector.

One can split the function selector into two parts: The function name transfer and the arguments address,uint256. There is a finite number of argument types and there are only so many human-readable/meaningful function names. In theory, it should be possible to generate the majority of the function selectors, if not all of them, by "cross-producting" all known function names with all known arguments. This would give us the Ultimate 4-byte Function Selector Database. The database can also include the variable names and their occurrences to show the likely names of the arguments to the user. In case of collisions, it can show what percentage of each original function signature for this selector was observed.

For instance if you look up the function selector `0xa9059cbb`:	signature	Occurance
transfer(address,uint256)	427401	%99.5
transfer(bytes4[9],bytes5[6],int48[11])	321	%0.4997
many_msg_babbage(bytes1)	1	%0.0001
join_tg_invmru_haha_fd06787(address,bool)	1	%0.0001
func_2093253501(bytes)	1	%0.0001

If you want to see what are likely argument names:

{
  transfer(address,uint256): {
    address: {
      _to: {
        occurance: 875924
        likelihood: 0.72
      },
      to: {
        occurance: 52242
        likelihood: 0.20
      },
      receiver: {
        occurance: 542
        likelihood: 0.05
      },
      xxx: {
        occurance: 1
        likelihood: 0.00001
      }
    },
    uint256: {
      _value: {
        occurance: 756954
        likelihood: 0.37
      },
      value: {
        occurance: 716954
        likelihood: 0.32
      },
      amount: {
        occurance: 666954
        likelihood: 0.29
      },
      xxx: {
        occurance: 1
        likelihood: 0.00001
      }
    }
  }
}

The idea initially came from @tjayrush and is already being experimentally used by the TrueBlocks chifra CLI's --find flag. He can give more information and use cases if you find this interesting.

The data sources can be but not limited to:

Sourcify: Sourcify holds a public repository of verified contracts. As of writing, there are 308,122 verified contracts on the Ethereum networks, and more on other supported EVM networks. Contract ABI JSONs already exist in the metadata.json of each contract. (ABI JSONs, function names, function arguments)
4byte.directory (function names, function arguments)
Etherscan? I'm not sure if Etherscan has a 4byte directory or if it's feasible to get all verified contract ABI JSONs.

I think this would make a nice hackathon project for people looking for not much blockchain programming experience and something that will be of real use. The tech stack is pretty flexible but it might make sense to use a PostgresSQL and a similar schema as 4byte.directory.

Skillset

What skills do you need, or think you might need to implement the idea?

[ ] Core Blockchain Development (Go, C++, Rust)
[X] Smart Contracts (Solidity, Vyper): Basic Solidity
[ ] DApps, Web3.JS (JavaScript, CSS)
[ ] Blockchain Operations, Infrastructure
[ ] Game theory, Crypto-Economics
[ ] Design, User Experience
[ ] Project Management
[X] Other: General coding experience, databases

Communication

[ ] Matrix handle: @kuzdogan:matrix.org

SireMartin commented 2 years ago

@kuzdogan I'd like to be part of this team, because this suits me perfectly! I am an experienced back-end developer and know a lot about databases and data processing, but relatively new to web3.

tjayrush commented 2 years ago

Thanks for the mention and attribution.

Quick note for a possible extension:

this can just as easily be applied to event signatures and 32-byte topic0 to identify 32-byte event signatures.

Another extension idea:

One could, conceivably, separate out certain frequently occurring 'nouns' and 'verbs' in the function (and event) name dataset. For example, the function names transferTokens, transferAssets and sendTokens, sendMoney would generate four four-byte signatures, but those four functions clearly have two verbs (transfer and send) and three nouns (Tokens, Assets, and Money). If the function names were further broken into verbs and nouns, those four function names would generate six four-bytes - 33% more. There are many examples of common verbs and nouns programmers use all the time. set and get for example are verbs that are almost always followed by nouns, as are 'buy, sell, burn, transfer, send, receive, etc. etc.

Another possible extension:

A system could be built to help Solidity developers to "come together in a natural Shelling point" where they submit their source code to a "linter of sorts" that searches for common names (or near common names) and suggests names that will be more likely to be found. For example, if a user named a function sellingTokens, he/she could be advised that if he/she changed the name to sellTokens various tools (including those have yet to be written) would be more likely to be able to decode that function's data. This is along the lines of the suggested "parameter names" suggestion mentioned above.

outsider-analytics commented 2 years ago

I have a project for the hackathon, but work with 4-byte all the time and would love to see some tools based around it! Possible extension: Something that would be awesome on this as well would be the ability to take a public smart contract and use code to build a table of the 4-byte function ids from that contract. I would think of using something like regex based Python code and either have them input the contract address or even just paste it from Etherscan. So often I am grabbing the functions from a long contract, taking out the extras, and putting them in a keccak converter to see if I am working with the tx or trace I think I am. The "ultimate database" should definitely take precedence btw 🙂 Good luck @kuzdogan, looking forward to seeing what you build in Berlin!

SireMartin commented 2 years ago

@tjayrush I am not aware of 32 byte event signatures logic. The only thing i could find is https://docs.soliditylang.org/en/latest/abi-spec.html?highlight=event%20signature#encoding-of-indexed-event-parameters Also the ABI of the sourcify examples only mentions the event type and the above mentions the parameter values, so probabely another topic... Could you point me to any additional document about it because it's not clear to me. Thanks!

kuzdogan commented 2 years ago

@outsider-analytics You can paste the ABI JSON to 4bytes.directory/import-abi but it does not give the calculated hashes of the submitted ABI JSON. This should be quite straightword to generate from the ABI JSON.

Edit: Also found this: https://github.com/cleanunicorn/abi2signature Should be easy to build and interface on top of this

tjayrush commented 2 years ago

I have a project for the hackathon, but work with 4-byte all the time and would love to see some tools based around it! Possible extension: Something that would be awesome on this as well would be the ability to take a public smart contract and use code to build a table of the 4-byte function ids from that contract. I would think of using something like regex based Python code and either have them input the contract address or even just paste it from Etherscan. So often I am grabbing the functions from a long contract, taking out the extras, and putting them in a keccak converter to see if I am working with the tx or trace I think I am. The "ultimate database" should definitely take precedence btw 🙂 Good luck @kuzdogan, looking forward to seeing what you build in Berlin!

See TrueBlocks -- chifra abis <address> does exactly what you're describing.

tjayrush commented 2 years ago

@tjayrush I am not aware of 32 byte event signatures logic. The only thing i could find is https://docs.soliditylang.org/en/latest/abi-spec.html?highlight=event%20signature#encoding-of-indexed-event-parameters Also the ABI of the sourcify examples only mentions the event type and the above mentions the parameter values, so probabely another topic... Could you point me to any additional document about it because it's not clear to me. Thanks!

Definitely the Ethereum Yellow paper. That's how I figured it out.

outsider-analytics commented 2 years ago

I have a project for the hackathon, but work with 4-byte all the time and would love to see some tools based around it! Possible extension: Something that would be awesome on this as well would be the ability to take a public smart contract and use code to build a table of the 4-byte function ids from that contract. I would think of using something like regex based Python code and either have them input the contract address or even just paste it from Etherscan. So often I am grabbing the functions from a long contract, taking out the extras, and putting them in a keccak converter to see if I am working with the tx or trace I think I am. The "ultimate database" should definitely take precedence btw 🙂 Good luck @kuzdogan, looking forward to seeing what you build in Berlin!

See TrueBlocks -- chifra abis <address> does exactly what you're describing.

Awesome, thank you! Super helpful :)

SireMartin commented 2 years ago

The plan is to extract the metadata json files to a build container, to transform and load them there into a redis or postgress. Then copy the db data to a runtime container, which contains a db + back and front-end.

SireMartin commented 2 years ago

@kuzdogan @tjayrush @outsider-analytics As i ran into technical problems at the hackathon, a "delayed" link to the initial setup: https://github.com/SireMartin/Ultimate4ByteDb Please follow the readme to get it running and thanks to evaluate it! Permutations are not included (yet), more on that later.

kuzdogan commented 2 years ago

Thanks @SireMartin ! Will have a look.

In the meantime leaving https://github.com/shazow/whatsabi here as it is quite related to this

kuzdogan commented 2 years ago

@SireMartin Sorry for the delay. I was able to easily run the application and it looks good! Did you run this on the complete Sourcify repository? Because I see differences with https://github.com/ethereum-lists/4bytes/tree/master/signatures , which also runs on Sourcify (which I wasn't aware of).

Also now I notice we actually need two databases:

The one that is built by @SireMartin with occurrences and likelihoods
The generated "pemutation db" without the occurrences and likelihoods. Because this db will be generated and there won't be contracts to count occurances.

SireMartin commented 2 years ago

@kuzdogan @tjayrush I did process the complete sourcify repo (both full_match and partial_match), and realized my example only contains a subset of https://github.com/ethereum-lists/4bytes/tree/master/signatures, which uses multiple data sources:

etherscan
sourcify
eveem
signatureninja
online 4bytedir
soliditydirectory Not all of those datasources provide the contract abi json, which provides the necessary argument names to determine the likelyhood. But i will contact the author to request access to the data sources which do.

The database i created is the result of an in-memory computation of all function/event signatures. This is possible because the amount of keys is limited to the amount of unique hashes of a finite set of humanly written signatures. 6.410.388 function/event signatures where processed, which resulted in 214.925 4byte selectors and a 39MB Redis DB.

The second database of permutations has to be built incrementally by a near-endless operation, because time will be the issue. The Sourcify repo alone contains 55.343 unique function/event names and 354.425 argument sets, which will result in a nealy infinite amount of permutations. Even the amount of permutation produced by 50 x 50 is in the magnitude of 10^64 (ref. https://www.calculatorsoup.com/calculators/discretemathematics/permutations.php). I think this will be a hard one and should be composed by a list carefully selected combinations of function names and argument sets.

kuzdogan commented 2 years ago

@SireMartin I guess the word "permutation" has been misleading. What meant was a "cross-product" between the function names and the arguments. So 55,343 x 354,425 = 19,614,942,775

tjayrush commented 2 years ago

I don't think this last part is correct (about the number of type signatures).

The number of unique "argument sets" can be massively smaller if one removes the argument's names. (Assuming that's not been done.)

If that has been done, then I think the only other explanation is that you're including structure definitions in the list, which is useful but not as much as they could be. If there are structures in the signature, this would allow you to display the function call, but you wouldn't be able to decode it, as you would need the structure definitions at the time of decoding as well as the signature -- something the decoder almost certainly doesn't have.

The "argument sets" should be for the built-in Solidity types only (excluding variable names).

As far as "carefully selecting" this is pretty much the opposite of the original intent, although, one could, I suppose, sort the argument sets based on the number of times they appear and throw out those that only appear once or a few times.

My expectation is that there are only about 1,000 unique native Solidity-type only type signatures with more than a small number of appearances in the list.

The number of function names is, I also expect, very much larger than 55,000. Here I would think that you should sort by frequency and remove rarely occurring functions as these would tend to "pollute" the results. Who cares if a function called HappensOnlyOnce resolves to the same four-byte as a function called HappensABillionTimes. The HappensOnce four-byte is noise and should appear nowhere.

ethb3rlin / find-a-team