blockchain-etl / ethereum-etl

Python scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, internal transactions. Data is available in Google BigQuery https://goo.gl/oY5BCQ
https://t.me/BlockchainETL
MIT License
2.95k stars 847 forks source link

Error in export_geth_traces #206

Open jiangxjcn opened 4 years ago

jiangxjcn commented 4 years ago

This is my export command: python -m ethereumetl export_geth_traces --start-block 0 --end-block 200 --provider-uri http://127.0.0.1:8545 --output traces/geth_traces.json

The error is: ValueError: result is None in response {'jsonrpc': '2.0', 'id': 0, 'error': {'code': -32000, 'message': 'parent 0000000000000000000000000000000000000000000000000000000000000000 not found'}}.

I think the problem is simliar with https://github.com/blockchain-etl/ethereum-etl/issues/140. But i have change syncmode=full gcmode=archive.

This is my geth start command: bin/geth --datadir /mnt/data/geth/ethereum --ethash.dagdir /mnt/data/geth/ethash --syncmode full --gcmode archive --rpc --rpcapi debug

jiangxjcn commented 4 years ago

And if i change command as follow: python -m ethereumetl export_geth_traces --start-block 1 --end-block 200 --provider-uri http://127.0.0.1:8545 --output traces/geth_traces.json

The error become: ValueError: result is None in response {'jsonrpc': '2.0', 'id': 130, 'error': {'code': -32000, 'message': 'required historical state unavailable (reexec=128)'}}.

medvedev1088 commented 4 years ago

Maybe this https://github.com/ethereum/go-ethereum/issues/18441

jiangxjcn commented 4 years ago

It will take several day. Geth did not upgrade before causing stuck in 920W block height for a long time.The synchronization started on Monday after an update to the geth version.

jiangxjcn commented 4 years ago

@medvedev1088 I did not start geth with gcmode=archive before blockheight 9300000. Does this mean I have to synchronize from blockheight 0 if I want to get the previous block state to trace transaction. Or will geth begin to get the state of old block after it reach the newest blockheight?(if i restart geth with setting gcmode=arvhive at the height of 9300000)

medvedev1088 commented 4 years ago

I believe you have to start it with gcmode=archive from the beginning. But it may take a couple of months and TB of disk for the full archival sync.

jiangxjcn commented 4 years ago

@medvedev1088 I'm sorry but want to ask why the number of records in traces so different from the number of internal transactions in the etherscan. In the google big query, the row of trace is around 1.7billion, but internal transaction in etherscan is only around 0.12billion.

medvedev1088 commented 4 years ago

For some reason Etherscan doesn't include all of the traces from Parity. For example here https://etherscan.io/vmtrace?txhash=0xcbcdc7e73d93be6a879aff31893e08b893f6c0397fb5bdb1410d430335d60f30&type=parity#raw you can see 6 calls to contract 0x58db1d7b86cbe2cd8383bf0b1cd8cc9e3c93d525, while on this page https://etherscan.io/address/0x58db1d7b86cbe2cd8383bf0b1cd8cc9e3c93d525#internaltx it shows only 1 internal transaction even with Advanced toggle on.

jiangxjcn commented 4 years ago

Thanks for your answer.You are always so warm-hearted. I compare these transaction and suspect that etherscan don't show internal transactions with type = call and value = 0. Maybe that's useful for some researches considering that sync an archive node require big ssd and a long time.

jiangxjcn commented 4 years ago

@medvedev1088 My tutor required me not to use data from etherscan and get internal transaction with full mode(without running an archive node considering we cannot get big ssd to use in the near future). Since geth or parity will replay all the transaction when sync with full mode, i think it's possible to record these internal transaction and drop those historical state information after getting these internal transaction. At the same time, i found that parity has pruning history parameter when start to sync, and maybe i could setting this parameter a bitter larger ,for example 20000, so that i have the last recent 20000 block's state information and i could use trace_filter to get internal transaction in these blocks. The only question for this way is i must make parity sync pause when those old state information has not be used and will be dropped. I prefer the second approach and want to get some suggestions from you.

medvedev1088 commented 4 years ago

That's an option. You could use pruning_history param and ethereumetl stream command https://ethereum-etl.readthedocs.io/en/latest/commands/#stream to export the latest blocks to Pub/Sub or Postgres. But even with this approach it will be challenging, as it's going to take a long time for parity/geth to run through all transactions (at least 1 month), and you'll have to monitor the process closely to make sure all blocks in the pruning history are exported in time.

jiangxjcn commented 4 years ago

@medvedev1088 It seems i have successfully get trace transaction with a full parity client. But now i have a few questions about how to deal with these data with neo4j. I've seen the ethereum data 's storage schema you designed for neo4j. At first, I conceived a similar plan. But then i realized that relationship cannot set index(You have already mentioned this question ). Later, I considered use node to represent transaction. However, on the one hand, doing so will complicates the model. On the other hand, this led to a huge increase in the number of node and edge. (I heared that neo4j Community edition support as much as 30 billion nodes). The exist data ,espacilly trace data will need around 2billion node and maybe several times edge to represent. It looks scary. Have you considered using nodes to represent transactions? And could you give me some advice on this question?

medvedev1088 commented 4 years ago

Have you considered full-text search indexes? It's the primary workaround suggested here https://github.com/neo4j/neo4j/issues/7225. What is your use-case for which you need relation indexes?

jiangxjcn commented 4 years ago

For searching transaction between a several time. Apart from this, internal transactions were called by external transactions(or internal transactions). Using node to represent transaction will make the relationship clear.

medvedev1088 commented 4 years ago

This makes sense. As you mentioned the number of nodes and the required storage will go up significantly. My guess is you'd need more than 3TB and a few days for the full load. If you decide to pursue this idea, let me know how it goes.

jiangxjcn commented 4 years ago

@medvedev1088 I'll continue to compare the two storage methods and dinally decide. Maybe i will give it a try. Another question. I am exporting trace with parity and ethereumetl. But after comparing the data i get with ethereumetl and the data you put in google big query, I found that there are some difference between them. On the one hand, i did not find trace_type = genesis and daofork in my dataset(although i know all genesis trace in block 0 ,and all daofork trace in block 1920000). On the other hand, it seems that the trace number in google big query is a bit larger than the trace i get(no more than 1%). Although the gap is small, I still want to know why. After comparing number of different type of trace ,i found the gap is in trace which trace_type=call. Is this resulted by parity version(my parity version is 2.5.13)?

medvedev1088 commented 4 years ago

On the one hand, i did not find trace_type = genesis and daofork in my dataset(although i know all genesis trace in block 0 ,and all daofork trace in block 1920000).

When you use the ethereumetl export_traces command you can specify --genesis-traces and --daofork-traces options to include genesis and daofork traces in respective blocks. These types of traces are generated by ethereumetl not parity that's why these flags are False by default.

On the other hand, it seems that the trace number in google big query is a bit larger than the trace i get(no more than 1%). Although the gap is small, I still want to know why. After comparing number of different type of trace ,i found the gap is in trace which trace_type=call. Is this resulted by parity version(my parity version is 2.5.13)

That's interesting. Could you provide a few examples? 1% difference is quite a lot.

jiangxjcn commented 4 years ago

@medvedev1088 Thanks for your answer. Fortunately, I don't think I'll use genesis and daofork trace.(Even though they are needed, i can get them in a day)

using this sql in google big-query select count(*) from bigquery-public-data.crypto_ethereum.traces where block_number>=2220001 and block_number<=2240000; I get the result of 244592 But the number is 244418 in the file i parse with ethereumetl(using parity2.5.13). the gap is 174

using this sql in google big query select count(*) from bigquery-public-data.crypto_ethereum.traces where block_number>=1000001 and block_number<=1020000; I get the result of 132395 But the number is132308 in the file i parse with ethereumetl(using parity2.5.13). the gap is 87

After adding a limit select count(*) from bigquery-public-data.crypto_ethereum.traces where block_number>=1000001 and block_number<=1020000 and trace_address!=""; I get the result of 38626. But the number is 38485 in the file i parse with ethereumetl. the gap is 141

It's very odd.

medvedev1088 commented 4 years ago

You could try uploading your csv to a BigQuery table, then querying trace_ids that are in crypto_ethereum.traces table but not in your csv using trace_id not in (select trace_id from "your_table"). This will help pinpoint the issue. You can filter by block_timestamp too to reduce the costs.

jiangxjcn commented 4 years ago

OK, i will have a try.

medvedev1088 commented 4 years ago

Thanks!

jiangxjcn commented 4 years ago

@medvedev1088 It seems that to continuely use google big query ,i need to apply a credit card. Since i was still a student, it may take several weeks. I'll contact you after I find the difference.

And an other question. I use extract_contract --trace traces.csv --output contracts command. Although the contracts seems to be parsed correctly(it has block_number). But in the parsing process , many evmdasm.disassembler [ERROR] - invalid instruction: PUSH XX was reported. I want to know why.

jiangxjcn commented 4 years ago

@medvedev1088 In addition to the difference about trace between geth and parity, were there any other difference between the data parse with geth and parity(both using ethereumetl)? To get trace data, i run a new parity node, but the geth client seems to take up too much space. I was wandering whether i could use ethereumetl with parity to get all the data instead of using parity get trace and using geth get other data at the same time.

medvedev1088 commented 4 years ago

were there any other difference between the data parse with geth and parity(both using ethereumetl)

The only difference we found is in traces. We also parse contract data and tokens from traces so those may differ too.

What Parity options do you use when running the node?

jiangxjcn commented 4 years ago

I set parity with a toml file [parity] mode = "active" base_path = "/mnt/parity" [network] warp = false [rpc] port = 9999 interface = "local" server_threads = 2 [footprint] tracing = "on" pruning = "fast" pruning_history = 30000 cache_size = 128000

[misc] logging="sync=debug" log_file = "/mnt/parity.log" color = true

medvedev1088 commented 4 years ago

@jiangxjcn Thanks! How long did it take to sync the node with this config and how much disk space does it use?

jiangxjcn commented 4 years ago

@medvedev1088 It takes about 15 days. I begin to sync it from 2020/4/22 with parity 2.5.13. Now the sync height is around 940W block. It's expected to take another three days to catch up the newest block. (It seems that parity 2.7.2 has something wrong(many users report this problem) and always be sunk when syncing. Try not to use this version.)

The total disk parity use now is 980GB. According to my original estimate, it will take less than 500GB . It's a bit odd. Usually, an archive node will use around 4TB disk, and a full node will use around 260GB disk. Maybe it's because I gave parity too much memory(128GB), and many tmp file are stored. I will reduce parity memory to 32GB after it catch up the newest block. Perhaps disk usage was more accurate at that time.

io-raid commented 3 years ago

I see that you need parity to ETL full traces from eth chain. However, is it possible to use OpenEthereum 3.1.0 instead?

jiangxjcn commented 3 years ago

I see that you need parity to ETL full traces from eth chain. However, is it possible to use OpenEthereum 3.1.0 instead?

I did not update parity to 3.+ version. You can have a try.

io-raid commented 3 years ago

@jiangxjcn I've heard that OpenEth 3.1.0 is a backport of Parity 2.5.13, I will give it a try. Is your Parity 2.5.13 node still successfully synced? No problems with consensus or any other bugs?

jiangxjcn commented 3 years ago

@jiangxjcn I've heard that OpenEth 3.1.0 is a backport of Parity 2.5.13, I will give it a try. Is your Parity 2.5.13 node still successfully synced? No problems with consensus or any other bugs?

I stop the syncing process last month due to the lack of disk. There is no problem before i stop the process.(it sync with the newest block)