0xB10C / project-ideas

project ideas I might work on if a day had more than 24h
MIT License
0 stars 0 forks source link

Historical ephemeral network data collection #7

Open 0xB10C opened 1 month ago

0xB10C commented 1 month ago

I'm often asked about historical data. For example:

It would be good to have a tool (could be a simple, timer-based script that just dumps and compresses the getrawmempool/getblocktemplate RPC response of a default-config node) and a storage place for this. The goal should be to make the data accessible to the public and easy to find.

While some people might have collected historical data, it's scattered, not public, and not findable. I can only try to establish a contact between parties, which often fails.

A data-sink for ephemeral bitcoin network data.

0xB10C commented 1 month ago

IMO, the real problem here is the storage cost. My gut feeling is that mempool snapshots and block templates can grow the storage by gigabytes per week (depending on query intervals; I haven't tested this).

josibake commented 1 month ago

I believe we've spoken about this in the past, and this was something I started on awhile back using bmon to collect logs and then storing these in GCS and exposing them via BigQuery, while also building aggregated views. Unfortunately, bmon is no longer running, so I put the project on hold until I could a) run bmon myself or b) find a suitable replacement.

Going forward

I've had this idea, affectionately named b-monster, which is to setup an event stream (e.g., Kafka or something homegrown if Kafka is overkill), that would allow streaming bitcoind logs + any other data format (e.g., USDT) as publishers. The event stream is the "source of truth" and could be archived to cheap cloud storage as needed, but would also allow consumers to then process the event stream into more useful data products. One example is the historical mempool dataset I've worked on in the past, which aims to recreate historic views of what the mempool would have looked like for any given block.

The idea of having a large message bus at the center is that individuals could "join" as publishers and have their logs/data sent to the event stream. This would allow capturing ephemeral data from the perspective of many nodes, which is important for things like mempool events.

Was planning to start work on this early 2025, let me know if you have any interest in collaborating!

0xB10C commented 1 month ago

This indeed sounds interesting and solves the same problem, yes. Thanks for commenting.

My personal set up would probably be a bit simpler (as I don't want to maintain a big "monster" :wink:): probably just script running in an interval that dumps the getrawmempool/getblocktemplate JSONs somewhere for archival. Less granular, but probably cheaper in time and money. Both set ups provide value IMO.

In general, it might be good to invest some time to think and talk about better Bitcoin Core interfaces for something like this.. e.g. extending the ZMQ interface, (ab)using the multiprocess interface, ...