Missing creator transaction hashes after migration

manuelwedler commented 2 months ago

We found that some contracts that had a creator-tx-hash.txt in the /files/any/ endpoint shown in our old kubernetes deployment didn't have it after the migration for our new setup. The transaction_hash column in contract_deployments is simply NULL for these cases.

I investigated two examples:

0x00000000219ab540356cBB839Cbe05303d7705Fa on chain 1

The logs show that there was a creatorTxHash found but fetching the creation bytecode failed. The latter probably happened due to an unresponsive RPC.

We should still store the creatorTxHash in these cases. This is fixed by https://github.com/ethereum/sourcify/pull/1572

0x4E7095a3519A33dF3D25774c2F9D7a89eB99745D on chain 11155111

Here the log is a bit misleading. The fetching of the creatorTxHash failed actually because no creatorTx is in the jsonPayload. Therefore, lib-sourcify also didn't match with creation tx. I fixed the logging also in the above PR.

So in this case the cause was probably the Etherscan API being unresponsive.

Database problem

What still remains is that we probably have a lot of contracts after the migration for which we are missing the creation tx hashes. I queried the DB and it shows that there are 3.7M contract_deployments without tx_hash and 1.4M with the tx_hash. This seems a lot without tx hash to me. Unfortunately, we cannot know how it was before because we have never recovered the old production DB fully.

Two options two fix this problem now:

We run the migration again and maybe add some checks for the transaction hashes to make sure they are fetched and stored.
We go through the old repo, check the creator-tx-hash.txt files and add them manually to the database if not present yet.

A side note: There are a probably more problems after the migration. We should check the database more thoroughly. An unresponsive RPC could also have caused a change in the match status after the migration.

kuzdogan commented 2 months ago

If we go with the second option we are going to have a dirty DB, that is contracts will have a creatorTxHash but no creation bytecodes. I think we should re-verify them.

We'll need a script that needs to connect to the both filesystems, the current one on GCP and the legacy one running on Argo. This script needs two lists:

The contracts that don't have a creationTxHash on the DB.
The contract that have a creator-tx-hash.txt in the legacy Argo repo

Then it should take an intersection of these two lists (address+chainId as identifiers). This would give us the full list that needs modification and also the number of contracts affected. We can write this list of contracts in a new database instance, a local one etc. to mark off the ones that are verified.

Finally we send these contracts to re-verification on the migration instance (the instance container image needs to be updated) and have them re-verified. Bear in mind we need a small change in the code to trigger updating the row of that contract is it goes from a null match to a partial.

Also before all of this review #1572

marcocastignoli commented 2 months ago

I think we can optimize this process by:

creating a dedicated sourcify_sync like table (sourcify_sync_tx_hash with additional transaction_hash field) on prod database and pushing all the contracts that have a creator-tx-hash.txt from the legacy Argo repo
Do a query to compare contract_deployments.transaction_hash with sourcify_sync_tx_hash.transaction_hash for each contract finding all the contracts that don't have transaction_hash in contract_deployments but they have it in sourcify_sync_tx_hash
For each of these contracts copy creator-tx-hash.txt from argo repo to gcp repo
Update those contracts contract_deployments.transaction_hash with an update query filtering using the query above

kuzdogan commented 2 months ago

Yes we have to do 1. and 2. regardless. Let's do that. Just not sure if using the prod database is a good idea. Maybe create a user that only has read rights to the main tables but writes just to the new table.

Regardin the rest, I'm not sure if it's a good idea to accept the creatorTxHash even if we can't fetch the Transaction from this hash. That's basically what this PR is doing.

First thing is, we are not checking if this txHash is indeed the txHash that created this contract. Yes we get this from a somewhat reliable resource (Etherscan, Blockscout etc.) but I think we should (not must) check this. How we do this now is by fetching the Transaction object and checking if the 'new contract address algorithm', that is the deployer address + nonce yields the address of the contract being verified.

Second thing is, even if we decide it's ok to write the creatorTxHash without double checking, we still need the values deployer, block_number, and tx_index of the contract. These are available only if we fetch the Transaction via this hash. I don't think it would make sense if we only write the creatorTxHash but not these values.

What we can do maybe is we can fail the whole verification if we can't fetch the Transaction from the expected API like Etherscan, instead of saving this contract as just a "runtime match". The verification will fail and it needs to be submitted again. In that case (or actually regardless) we should monitor the errors in fetching Transaction's and get notified if e.g. we are hitting Etherscan API limits.

marcocastignoli commented 2 months ago

Regardin the rest, I'm not sure if it's a good idea to accept the creatorTxHash even if we can't fetch the Transaction from this hash. That's basically what this PR is doing.

I agree we should not accept creatorTxHash regardless. From my POV, we should implement a new service that goes through the database and tries to re-verify all the contracts that don't have creation information. But I'm wondering if it makes sense to first implement a more reliable way to fetch the creation bytecode, for example, by starting to index all creation bytecode.

ethereum / sourcify