Open medvedev1088 opened 4 years ago
Another optimisation idea is to separate parsing into 2 steps:
select * from logs where contract_address in (<all_contract_addresses_from_table_definistions>)
into filtered_logs
.filtered_logs
.We are going to encounter this in on other chains that support smart contracts in general, and multiple networks that support EVM-compatible contracts specifically. Here are a few examples:
I propose that we abstract this pattern above Ethereum.
Related issues, with emphasis on stream processing. https://github.com/blockchain-etl/ethereum-etl/issues/216 https://github.com/blockchain-etl/ethereum-etl/issues/215
Another related issue for indexing: https://github.com/blockchain-etl/ethereum-etl-airflow/issues/28
Here the proposal is to cluster/partition the results in BQ to reduce IO.
I suggest we take this a step further to generally support all contracts for which we have parsing capability (i.e. those for which we have an ABI).
Specifically, at transaction processing time in the stream:
contract_events_0x00
topic_events_0x00
all_events
topic with an additional possibly composite key, composed of (contract byte and/or event byte).events
table using all_events
. This table contains a JSON blob field that is not further optimized for storage, but for which we can leverage BQ native JSON parsing functions to select data of interest with JSONPath, see:
How to parse JSON in Standard SQL BigQuery?events
as needed.
OPTIONS
clause. The overall effect of this design:
Regarding this:
streaming insert into a new events table using all_events. This table contains a JSON blob field that is not further optimized for storage, but for which we can leverage BQ native JSON parsing functions to select data of interest with JSONPath, see: How to parse JSON in Standard SQL BigQuery?
we need to think about how we process historical data for newly added ABIs, so that this data appears in the evets
table in BQ. E.g. have a separate process to reprocess past partitions whenever a new ABI is added to the ABI repository.
@sinahab @epheph @jieyilong FYI we are designing a generic solution for ABI event generation in this issue and also blockchain-etl/ethereum-etl#216
Awesome, will take a look. Thanks!
On Thu, May 21, 2020 at 10:00 AM Allen Day notifications@github.com wrote:
@sinahab https://github.com/sinahab @epheph https://github.com/epheph @jieyilong https://github.com/jieyilong FYI we are designing a generic solution for ABI event generation in this issue and also blockchain-etl/ethereum-etl#216 https://github.com/blockchain-etl/ethereum-etl/issues/216
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/blockchain-etl/ethereum-etl-airflow/issues/50#issuecomment-632090162, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEMSYCS6MLS6MKYMHQUAXJTRSVM4RANCNFSM4KPSNCQA .
-- Jieyi Long CTO, Theta Labs, Inc https://www.ThetaLabs.org
Disclosure: I'm very new to this project... so pardon me if I missed or misunderstood things :)
Here are my thoughts, in no particular order, on the proposal by @allenday above. I focused more heavily on the use case I'm trying to solve for Origin Protocol since that's the one I'm the most familiar with.
Right now we have ~500 events that we parse. Every day ~500MB of log data is generated. This sums up to 250GB parsed in BigQuery daily, ~7.5 TB per month. Which totals to ~$37 per month.
With 1500 events we'll spend ~$100 per month.
An alternative to parsing logs in BigQuery is export JSON file to GCS download locally in Airflow and filter all events in a dataset at once, then load to BigQuery (free). There is PoC for how to parse logs in Python here https://github.com/blockchain-etl/ethereum-etl/commit/6710e6b894789d83ce3c28db79b4ec5bf2ec3845.