App Hash Fetching for Block Range

bh2smith commented 1 year ago

This PR introduces a DuneFetcher and two methods for fetching App Data via Dune. The goal here is to fetch and dump unique app hashes for a block range that have not yet been seen.

The main logic is in main.py

Each time the script is run (say, as a cron job) it will:

get the BlockRange(from, to) for which it should query dune for new app hashes. block_from is read from storage as the latest_synced_block (seen at the end of this bulleted list) and block_to is the latest block indexed by dune from the table containing app hashes (GPv2_call_settle).
query dune for new app hashes in that block range
write these hashes to file and also record the BlockRange.block_to as the last synced block.

Example

After running this once, you will see something like these two files written to storage:

The app_hashes file contains the app hashes to the block_number in the file range.
"sync_block.csv" will look like this:
```
last_synced_block
15981705
```

If you run the script again later you will see something like this in the logs

2022-11-16 11:05:15,328 INFO dune_client.file Nothing to write to app_hashes_15981836... skipping

and sync block will be updated with this:

last_synced_block
15981836

There are not many new app hashes incoming, but when they do, you would expect to see a new app_hashes file created with only the new entries.

Test Plan

You can run it yourself with python -m pysrc.main, but I will add a unit test demonstrating how the pieces work together.

josojo commented 1 year ago

Why have you decided to go for a block_range introduction? Could you motivate it a little bit.

What are the advantages over the run-loop proposal?

The problem I am seeing is that appData does not necessarily belong to blocks. The time delay between APPdata creation and tx mining could be several minutes. This is why we used to pull the newest app_data already from the API and started to download it already. Happy to have a sync chat...

bh2smith commented 1 year ago

Thanks @josojo for your feedback and questions. We synced about this offchain and I it seemed as if you were content with the approach taken here. There was a lot discussed that I can't write here so I will just try to briefly address the questions from above.

Why have you decided to go for a block_range introduction?

Because this is an immutable index on the content. It is uniquely determined from the time appData first appears in GPv2_call_settle. Previously we had also fetched this from the orderbook, but this doesn't serve any purpose since the affiliate data (such as trading volume) associated with appContent (i.e. referral address) is not going to be available until long after the trades have been finalized on chain (Dune V1 ~30 minute time delay for prices, Dune V2 > 1 hour).

What are the advantages over the run-loop proposal?

Run loop index does not persist between cronjobs (also cronjobs don't have a runloop) - block number can be persisted between two distinct executions and is more meaningful than an incremental index.

The end goal here is that the API (rust) service - the one with the run loop will no longer be responsible for fetching or indexing app content, but rather it will rely on the content fetched from Dune API via query by appHash and caching with expiry logic. Because of this, there will no longer be a "runloop" for the appData sync.

bh2smith commented 1 year ago

I agree these things could be moved, but I'd like to see them in good shape before wasting more time going back and forth updating the dune client. I don't believe either of these two "features" to be finalized. Is it alright if we keep it here for now?

QueryData: Actually, there are a few good reasons not to move this now.
- it stores this "filename" info that isn't actually relevant to the client package since there currently is not way to upsert queries (I was just including this in case it becomes a feature later).
- It is also not yet settled whether the QueryData should contain both v1 and v2 queries (as we have used it in solver rewards).

So in essence, the QueryData doesn't really belong in the client package just yet or at least I don't believe we have landed on a finalized version of this particular data structure or its functionality.

Async Handling: I am kinda waiting on this other guy to get back to me about how to use the existing AsyncDuneClient cf - here: https://github.com/cowprotocol/dune-client/pull/31#issuecomment-1316045313 and here: https://github.com/cowprotocol/dune-client/issues/38

bh2smith commented 1 year ago

But it also comes with some assumptions: No reorgs in dune, dune uploads the data atomic per block, and dune provides data of block n only after providing the info for all previous blocks.

I think its safe to assume some of these things for now as we develop. Hopefully the code is structured in a way that would allow us to come back and fix things if our assumptions turn out not to be true (this can all easily be identified with a our Query Monitoring Service).

Perhaps you could boost my questions to Dune and add a some of these additional ones in our conversation with them. I will relay them to the folks who are helping with the sync.

bh2smith commented 1 year ago

Still looking for an approval on this one...

cowprotocol / dune-bridge

App Hash Fetching for Block Range #37

Example

Test Plan