ethereum / cthaeh

A standalone application which serves the Ethereum JSON-RPC log filtering APIs
MIT License
3 stars 6 forks source link

Explore more efficient data loading. #8

Open pipermerriam opened 4 years ago

pipermerriam commented 4 years ago

What was wrong?

The ORM data model is setup with the following loose constraints.

  1. A Header has a nullable foreign key to it's parent
  2. A Block must point to a header
  3. A Transaction optionally point to a block.
  4. A Receipt must point to a transaction
  5. A Log must point to a receipt

Currently, to import a block we build and bunk save this entire hierarchy for a single block. Each block is imported sequentially and cannot be done concurrently due to the foreign key constraint to the parent block.

However, since Headers can have a null parent and transaction can have a null block, we should be able to add a level of concurrency for improved efficiency of data loading.

How can it be fixed?

We should be able to adjust our pipeline such that:

  1. We load the "Transaction < Receipt < Log" sets concurrently
  2. We load the "Header < Block" data concurrently with all headers having a null parent pointer.
  3. We link the "Block < Transaction" concurrently (once both sides have been loaded)
  4. We link the "Header" to it's parent sequentially once all of the above have been executed.

Before doing this we need some benchmarks in place to measure performance. I would suggest we benchmark against a wide range of real mainnet blocks.

pipermerriam commented 4 years ago

As a rough baseline, I am currently importing the chain at around block 4-million. At this height I'm seeing performance at around 4 blocks-per-second and 700 rows per second (a row being a single row in the database for the full block data).