Closed kuzdogan closed 6 months ago
@kuzdogan I'd like to be part of this team, because this suits me perfectly! I am an experienced back-end developer and know a lot about databases and data processing, but relatively new to web3.
Thanks for the mention and attribution.
Quick note for a possible extension:
this can just as easily be applied to event signatures and 32-byte topic0
to identify 32-byte event signatures.
Another extension idea:
One could, conceivably, separate out certain frequently occurring 'nouns' and 'verbs' in the function (and event) name dataset. For example, the function names transferTokens
, transferAssets
and sendTokens
, sendMoney
would generate four four-byte signatures, but those four functions clearly have two verbs (transfer
and send
) and three nouns (Tokens
, Assets
, and Money
). If the function names were further broken into verbs and nouns, those four function names would generate six four-bytes - 33% more. There are many examples of common verbs and nouns programmers use all the time. set
and get
for example are verbs that are almost always followed by nouns, as are 'buy
, sell
, burn
, transfer
, send
, receive
, etc. etc.
Another possible extension:
A system could be built to help Solidity developers to "come together in a natural Shelling point" where they submit their source code to a "linter of sorts" that searches for common names (or near common names) and suggests names that will be more likely to be found. For example, if a user named a function sellingTokens
, he/she could be advised that if he/she changed the name to sellTokens
various tools (including those have yet to be written) would be more likely to be able to decode that function's data. This is along the lines of the suggested "parameter names" suggestion mentioned above.
I have a project for the hackathon, but work with 4-byte all the time and would love to see some tools based around it! Possible extension: Something that would be awesome on this as well would be the ability to take a public smart contract and use code to build a table of the 4-byte function ids from that contract. I would think of using something like regex based Python code and either have them input the contract address or even just paste it from Etherscan. So often I am grabbing the functions from a long contract, taking out the extras, and putting them in a keccak converter to see if I am working with the tx or trace I think I am. The "ultimate database" should definitely take precedence btw 🙂 Good luck @kuzdogan, looking forward to seeing what you build in Berlin!
@tjayrush I am not aware of 32 byte event signatures logic. The only thing i could find is https://docs.soliditylang.org/en/latest/abi-spec.html?highlight=event%20signature#encoding-of-indexed-event-parameters Also the ABI of the sourcify examples only mentions the event type and the above mentions the parameter values, so probabely another topic... Could you point me to any additional document about it because it's not clear to me. Thanks!
@outsider-analytics You can paste the ABI JSON to 4bytes.directory/import-abi but it does not give the calculated hashes of the submitted ABI JSON. This should be quite straightword to generate from the ABI JSON.
Edit: Also found this: https://github.com/cleanunicorn/abi2signature Should be easy to build and interface on top of this
I have a project for the hackathon, but work with 4-byte all the time and would love to see some tools based around it! Possible extension: Something that would be awesome on this as well would be the ability to take a public smart contract and use code to build a table of the 4-byte function ids from that contract. I would think of using something like regex based Python code and either have them input the contract address or even just paste it from Etherscan. So often I am grabbing the functions from a long contract, taking out the extras, and putting them in a keccak converter to see if I am working with the tx or trace I think I am. The "ultimate database" should definitely take precedence btw 🙂 Good luck @kuzdogan, looking forward to seeing what you build in Berlin!
See TrueBlocks -- chifra abis <address>
does exactly what you're describing.
@tjayrush I am not aware of 32 byte event signatures logic. The only thing i could find is https://docs.soliditylang.org/en/latest/abi-spec.html?highlight=event%20signature#encoding-of-indexed-event-parameters Also the ABI of the sourcify examples only mentions the event type and the above mentions the parameter values, so probabely another topic... Could you point me to any additional document about it because it's not clear to me. Thanks!
Definitely the Ethereum Yellow paper. That's how I figured it out.
I have a project for the hackathon, but work with 4-byte all the time and would love to see some tools based around it! Possible extension: Something that would be awesome on this as well would be the ability to take a public smart contract and use code to build a table of the 4-byte function ids from that contract. I would think of using something like regex based Python code and either have them input the contract address or even just paste it from Etherscan. So often I am grabbing the functions from a long contract, taking out the extras, and putting them in a keccak converter to see if I am working with the tx or trace I think I am. The "ultimate database" should definitely take precedence btw 🙂 Good luck @kuzdogan, looking forward to seeing what you build in Berlin!
See TrueBlocks --
chifra abis <address>
does exactly what you're describing.
Awesome, thank you! Super helpful :)
The plan is to extract the metadata json files to a build container, to transform and load them there into a redis or postgress. Then copy the db data to a runtime container, which contains a db + back and front-end.
@kuzdogan @tjayrush @outsider-analytics As i ran into technical problems at the hackathon, a "delayed" link to the initial setup: https://github.com/SireMartin/Ultimate4ByteDb Please follow the readme to get it running and thanks to evaluate it! Permutations are not included (yet), more on that later.
Thanks @SireMartin ! Will have a look.
In the meantime leaving https://github.com/shazow/whatsabi here as it is quite related to this
@SireMartin Sorry for the delay. I was able to easily run the application and it looks good! Did you run this on the complete Sourcify repository? Because I see differences with https://github.com/ethereum-lists/4bytes/tree/master/signatures , which also runs on Sourcify (which I wasn't aware of).
Also now I notice we actually need two databases:
@kuzdogan @tjayrush I did process the complete sourcify repo (both full_match and partial_match), and realized my example only contains a subset of https://github.com/ethereum-lists/4bytes/tree/master/signatures, which uses multiple data sources:
The database i created is the result of an in-memory computation of all function/event signatures. This is possible because the amount of keys is limited to the amount of unique hashes of a finite set of humanly written signatures. 6.410.388 function/event signatures where processed, which resulted in 214.925 4byte selectors and a 39MB Redis DB.
The second database of permutations has to be built incrementally by a near-endless operation, because time will be the issue. The Sourcify repo alone contains 55.343 unique function/event names and 354.425 argument sets, which will result in a nealy infinite amount of permutations. Even the amount of permutation produced by 50 x 50 is in the magnitude of 10^64 (ref. https://www.calculatorsoup.com/calculators/discretemathematics/permutations.php). I think this will be a hard one and should be composed by a list carefully selected combinations of function names and argument sets.
@SireMartin I guess the word "permutation" has been misleading. What meant was a "cross-product" between the function names and the arguments. So 55,343 x 354,425 = 19,614,942,775
I don't think this last part is correct (about the number of type signatures).
The number of unique "argument sets" can be massively smaller if one removes the argument's names. (Assuming that's not been done.)
If that has been done, then I think the only other explanation is that you're including structure definitions in the list, which is useful but not as much as they could be. If there are structures in the signature, this would allow you to display the function call, but you wouldn't be able to decode it, as you would need the structure definitions at the time of decoding as well as the signature -- something the decoder almost certainly doesn't have.
The "argument sets" should be for the built-in Solidity types only (excluding variable names).
As far as "carefully selecting" this is pretty much the opposite of the original intent, although, one could, I suppose, sort the argument sets based on the number of times they appear and throw out those that only appear once or a few times.
My expectation is that there are only about 1,000 unique native Solidity-type only type signatures with more than a small number of appearances in the list.
The number of function names is, I also expect, very much larger than 55,000. Here I would think that you should sort by frequency and remove rarely occurring functions as these would tend to "pollute" the results. Who cares if a function called HappensOnlyOnce
resolves to the same four-byte as a function called HappensABillionTimes
. The HappensOnce
four-byte is noise and should appear nowhere.
Introduction
I am Kaan, the maintainer of Sourcify and an organizer at ETHBerlin.
I won't be hacking myself during the event but wanted to share this idea in case anyone finds it interesting and wants to build. Happy to help if you have questions.
Abstract
Create a really large 4-byte function selector database by permutating the existing function names and function arguments. This would make things easier for contract reverse-engineering and inferring the likelihood of the argument names of unverified contracts.
Idea description
Application Binary Interface (ABI) is the standard way to interact with the contracts on Ethereum. A function call to a contract contains the 4-byte function selector and the arguments/parameters of the function encoded in a specific way.
The 4-byte function selector is the first 4-bytes of the hash of the function signature, which is the canonical representation of the function without the variable names and the return value. For instance for the function
transfer(address,uint256)
a9059cbb2ab09eb219583f4a59a5d0623ade346d962bcd4e46b11da047c9049b
a9059cbb
Ideally, a contract's source code is public and verified e.g. on Sourcify so that a user would have all the information at hand to construct a function call. This information can also be found in the ABI JSON of a contract in a structured way (which sometimes is referred to as "the ABI"). For a contract that is not verified, users can interact with it using the information in the ABI JSON.
The fact that function selectors are hashes has two problems:
That's why the community came up with function signature databases such as 4byte.directory or ethereum-lists/4bytes. Currently 4byte.directory holds close to 1M function+event signatures. If you see a calldata starting with
0x3416f9d4
and it is found in the database assubtractSafely(uint256,uint256)
, you'd know that it is highly likely a subtraction function. However for common func. names such as transfer people submit colliding function names (remember "At Inversebrah"?). The existing directories are simple hash tables and don't hold the number of observed occurrences of the original function signature for a function selector.One can split the function selector into two parts: The function name
transfer
and the argumentsaddress,uint256
. There is a finite number of argument types and there are only so many human-readable/meaningful function names. In theory, it should be possible to generate the majority of the function selectors, if not all of them, by "cross-producting" all known function names with all known arguments. This would give us the Ultimate 4-byte Function Selector Database. The database can also include the variable names and their occurrences to show the likely names of the arguments to the user. In case of collisions, it can show what percentage of each original function signature for this selector was observed.0xa9059cbb
:If you want to see what are likely argument names:
The idea initially came from @tjayrush and is already being experimentally used by the TrueBlocks chifra CLI's --find flag. He can give more information and use cases if you find this interesting.
The data sources can be but not limited to:
I think this would make a nice hackathon project for people looking for not much blockchain programming experience and something that will be of real use. The tech stack is pretty flexible but it might make sense to use a PostgresSQL and a similar schema as 4byte.directory.
Skillset
What skills do you need, or think you might need to implement the idea?
Communication