ethereum / sourcify

Decentralized Solidity contract source code verification service
https://sourcify.dev
MIT License
777 stars 390 forks source link

Implement a mechanism to bypass compilation for already existing compiled contracts. #1659

Open marcocastignoli opened 3 days ago

marcocastignoli commented 3 days ago

From https://github.com/ethereum/sourcify/issues/1632 we understood that most contracts share the same code, so it is possible to optimize Sourcify by skipping compilation fro already existing compiled contracts.

  1. Understand when we can skip the compilation
  2. Outline the process of skipping the compilation
kuzdogan commented 3 days ago

Related #1643 #1632

marcocastignoli commented 3 days ago

Understand when we can skip the compilation

Equal compilation_settings/sources In order to understand when we can skip compilation ideally we could select from compiled_contracts filtering by compiler_settings, sources and compiler_version. The problem with this approach is that we would have to index jsonb fields and it's not optimal for several reasons (from ChatGPT):

  1. Index Size: JSON data can be quite large, and indexing a long jsonb value can significantly increase the size of your index, which can affect disk usage.
  2. Index Maintenance: Large indexes can slow down insert, update, and delete operations because the index needs to be updated whenever the jsonb data changes.

We can optimize this by first filter by fully_qualified_name+version (after we put an index on them) and then trying to match the compilation_settings/sources (EDIT: or metadata) only with the few results filtered by fully_qualified_name+version.

Equal runtime_code_hash Another possible reason to skip compilation is checking if the compiled_contracts.runtime_code_hash matches with the onchain bytecode from the contracts that is being verified. The match must be a metadata_match.

kuzdogan commented 3 days ago

Can we maybe utilize the metadata somehow? Because the "metadata hash" is somehow the fingerprint of the compilation

Another possible reason to skip compilation is checking if the compiled_contracts.runtime_code_hash matches with the onchain bytecode from the contracts that is being verified. The match must be a metadata_match.

I'm not sure if this is straightforward. Don't we normalize the bytecodes? So their hashes wont be matching with the onchain bytecodes

marcocastignoli commented 3 days ago

So their hashes wont be matching with the onchain bytecodes

What about we compare the onchain bytecode with the stored onchain bytecodes (that are not normalized) passing through verified_contracts? Following #1643

in other words:

select cc.*
from compiled_contracts cc 
left join verified_contracts vc on vc.compilation_id = cc.id 
left join contract_deployments cd on vc.deployment_id = cd.id 
left join contracts c on cd.contract_id = c.id
where c.runtime_code_hash = 'current_verification_onchain_bytecode' and vc.runtime_metadata_match = true
kuzdogan commented 3 days ago

Oh yes, so onchain bytecodes, we don't normalize, right?

This would work for contracts that have the exact same bytecode, if the immutables, libraries etc. change then it does not work. Still a good starting point.

How about the metadata?

I see as a longer research topic, us going into a "bytecode similarity" search direction. Blockscout is already doing something similar https://docs.blockscout.com/about/features/ethereum-bytecode-database-microservice#similar-contracts-search-enhancement

marcocastignoli commented 3 days ago

Another easy solution could be to include a metadata sha256 column in sourcify_matches table. We index it, and we just use that to skip compilation, that's straightforward and easy to compute in hindsight.

E.g. services/server/src/server/services/VerificationService.ts

  public async verifyDeployed(
    checkedContract: CheckedContract,
    sourcifyChain: SourcifyChain,
    address: string,
    creatorTxHash?: string,
  ): Promise<Match> {

    // ...

    // Use sha256(CheckcedContract.metadataRaw) to find already existing compilation output 
    // in the database that was created from the same metadata
    const compilationOutput = await findCompilationOutputFromMetadataHash(checkedContract)
    if (compilationOutput) {
      // setting compilation output on the checkedContract will make the CheckedContract.recompile() return early
      checkedContract.setCompilationOutput(compilationOutput);
    }
    /* eslint-disable no-useless-catch */
    try {
      const res = await libSourcifyVerifyDeployed(
        checkedContract,
        sourcifyChain,
        address,
        foundCreatorTxHash,
      );
      // ...
manuelwedler commented 2 days ago

Is it somehow possible that a wrong metadata hash is appended at the onchain bytecode?

marcocastignoli commented 2 days ago

Is it somehow possible that a wrong metadata hash is appended at the onchain bytecode?

If I'm not wrong we always save the recompiled one, not the onchain one, so this should not be a problem

manuelwedler commented 2 days ago

Is it somehow possible that a wrong metadata hash is appended at the onchain bytecode?

If I'm not wrong we always save the recompiled one, not the onchain one, so this should not be a problem

Yes, but I'm talking about the metadata hash of the contract for which you want to save the compilation.

marcocastignoli commented 1 day ago

Is it somehow possible that a wrong metadata hash is appended at the onchain bytecode?

If I'm not wrong we always save the recompiled one, not the onchain one, so this should not be a problem

Yes, but I'm talking about the metadata hash of the contract for which you want to save the compilation.

I would not use the onchain hash, I would use the hash of the uploaded metadata.json file.

The uploaded metadata.json hash contains all the information used for compilation, so if an already existing metadata with the same hash exists it means we can skip compilation

manuelwedler commented 1 day ago

Okay that makes sense. The problem with this approach will then be that we are moving away from requiring a metadata.json with API v2. So we could only skip compilation if we have a metadata.json. Or do you think it could be generated from standard json input?