ethpandaops / xatu-data

MIT License
15 stars 3 forks source link

Xatu data

This dataset contains a wealth of information about the Ethereum network, including detailed data on beacon chain events, mempool activity, and canonical chain events. Read more in our announcement post.

This work is licensed under CC BY 4.0

[!IMPORTANT]
Join the Xatu Data Telegram group to stay up to date: https://t[dot]me/+JanoQFu_nO8yNzQ1

Table of contents

Available data

Dataset Name Schema Description Prefix EthPandaOps Clickhouse Public Parquet Files
Beacon API Event Stream Schema Events derived from the Beacon API event stream beaconapi
Execution Layer P2P Schema Events from the execution layer p2p network mempool_
Canonical Beacon Schema Events derived from the finalized beacon chain canonicalbeacon
Canonical Execution Schema Data extracted from the execution layer canonicalexecution
Consensus Layer P2P Schema Events from the consensus layer p2p network libp2p_
MEV Relay Schema Events derived from MEV relays mevrelay

Note: Public parquet files are available to everyone. Access to EthPandaOps Clickhouse is restricted. If you need access please reach out to us at ethpandaops at ethereum.org.

Check out the visual representation of the extraction process.

Schema

For a detailed description of the data schema, please refer to the Schema Documentation.

Working with the data

Public data is available in the form of Apache Parquet files. You can use any tool that supports the Apache Parquet format to query the data.

If you have access to EthPandaOps Clickhouse you can query the data directly. Skip ahead to Using EthPandaOps Clickhouse.

Getting started

There's a few ways to get started with the data. First install the dependencies:

  1. Install docker
  2. Verify the installation by running the following command:
      docker version

Choose your data access method

There are three options to get started with the data, all of them using Clickhouse.

Running your own Clickhouse

Running your own Clickhouse cluster is recommended for most use cases. This process will walk you through the steps of setting up a cluster with the Xatu Clickhouse migrations and importing the data straight from the public parquet files.

Using EthPandaOps Clickhouse

The EthPandaOps Clickhouse cluster already has the data loaded and the schema migrations applied. You can query the data directly. If you need access please reach out to us at ethpandaops at ethereum.org. Access is limited.

Querying public parquet files

Querying the public parquet files is a great way to get started with the data. We recommend you don't do this for larger queries or queries that you'll run again.

Examples:

docker run --rm -it clickhouse/clickhouse-server clickhouse local --query="
  SELECT
    count(*), meta_consensus_implementation 
  FROM url('https://data.ethpandaops.io/xatu/mainnet/databases/default/beacon_api_eth_v1_events_block/2024/3/20.parquet', 'Parquet')
  GROUP BY meta_consensus_implementation 
  FORMAT Pretty
"
docker run --rm -it clickhouse/clickhouse-server clickhouse local --query="
  SELECT
      count(*),
      extra_data_string
  FROM url('https://data.ethpandaops.io/xatu/mainnet/databases/default/canonical_execution_block/1000/{20000..20010}000.parquet', 'Parquet')
  WHERE
      block_number BETWEEN 20000000 AND 20010000
  GROUP BY extra_data_string
  ORDER BY count(*) DESC
  LIMIT 5
  FORMAT Pretty
"

Examples

Once your Clickhouse server is setup and the data is imported, you can query the data.

Queries

Jupyter Notebooks

There are some examples for both Parquet and Clickhouse and SQLAlchemy in the examples/parquet and examples/clickhouse directories respectively.

Contribute to Xatu data

We're excited to announce that we are opening up the Xatu data collection pipeline to the Ethereum community! This initiative enables community members to contribute valuable data to the Xatu dataset.

As discussions regarding the potential increase in maximum blob count continue we hope to shed light on the perspective of Ethereum's most crucial participants - home stakers.

Summary:

Data Collection

Overview

Data is collected by running a Beacon node and the xatu sentry sidecar. The data is then sent to a pipeline that we run, which further anonymizes and redacts the data.

graph TD
    A1[Home Staker 1] --> B1[Beacon Node]
    A2[You!] --> B2[Beacon Node]
    A3[Home Staker 3] --> B3[Beacon Node]
    B1 --> X1[Xatu Sentry]
    B2 --> X2[Xatu Sentry]
    B3 --> X3[Xatu Sentry]
    C[EthPandaOps]
    C --> D[Data Pipeline]

    D --> E[Public Parquet Files]

    X1 --> C
    X2 --> C
    X3 --> C

    subgraph "Data Collection"
        A1
        A2
        A3
        B1
        B2
        B3
        X1
        X2
        X3
    end

    subgraph " "
        C
        D
    end

    subgraph " "
        E
    end
    linkStyle 0 stroke:#f66,stroke-width:2px;
    linkStyle 1 stroke:#f66,stroke-width:2px;
    linkStyle 2 stroke:#f66,stroke-width:2px;
    linkStyle 3 stroke:#f66,stroke-width:2px;
    linkStyle 4 stroke:#f66,stroke-width:2px;
    linkStyle 5 stroke:#f66,stroke-width:2px;
    linkStyle 6 stroke:#f66,stroke-width:2px;
    linkStyle 7 stroke:#f66,stroke-width:2px;
    linkStyle 8 stroke:#f66,stroke-width:2px;
    linkStyle 9 stroke:#f66,stroke-width:2px;
    linkStyle 10 stroke:#f66,stroke-width:2px;

Events Collected

The following events will be collected:

Metadata

The following additional metadata is sent with every event:

Client Metadata
clock_drift: '2' # Clock drift of the host machine
ethereum:
    consensus:
        implementation: lighthouse # Beacon node implementation
        version: Lighthouse/v5.3.0-d6ba8c3/x86_64-linux # Beacon node version
    network:
        id: '11155111' # Ethereum network ID
        name: sepolia # Ethereum network name
id: 98df53c0-3de0-477c-a7c9-4ea9b17981c3 # Session ID. Resets on restart
implementation: Xatu
module_name: SENTRY
name: b538bfd92sdv3 # Name of the sentry. Hash of the Beacon Node's node ID.
os: linux # Operating system of the host running sentry
version: v0.0.202-3645eb8 # Xatu version
Server Metadata

Once we recieve the event, we do some additional processing to get the server metadata. The data that is added to the event is configurable per-user and allows users to only disclose data they're comfortable with. Geo location data is very useful for understanding how data is propagated through the network, but is not required.

server:
  client:
    geo:
      # OPTIONAL FIELDS
      ## Data about ISP
      autonomous_system_number: 24940 # Autonomous system number of the client
      autonomous_system_organization: "Hetzner Online GmbH" # Organization associated with the autonomous system

      ## Data about location
      city: "Helsinki" # City where the client is located
      continent_code: "EU" # Continent code of the client's location
      country: "Finland" # Country where the client is located
      country_code: "FI" # Country code of the client's location

      ### ALWAYS REDACTED
      latitude: REDACTED # Latitude coordinate of the client's location
      longitude: REDACTED # Longitude coordinate of the client's location
    group: "asn-city" # Group the client belongs to
    user: "simplefrog47" # Pseudo username that sent the event
    # ALWAYS REDACTED
    ip: "REDACTED" # IP address of the client that sent the event
  event:
    received_date_time: "2024-10-04T03:00:48.533351629Z" # Timestamp when the event was received

Note:

Privacy groups

Privacy is a top priority for us. We have created privacy groups to allow users to only disclose data they're comfortable with.

No additional Geo/ASN data

No additional Geo/ASN data ```yaml autonomous_system_number: REDACTED # REDACTED autonomous_system_organization: REDACTED # REDACTED city: "REDACTED" # REDACTED country: "REDACTED" # REDACTED country_code: "REDACTED" # REDACTED continent_code: "REDACTED" ```

With ASN data

Share geo location down to the city level ```yaml autonomous_system_number: 24940 autonomous_system_organization: "Hetzner Online GmbH" city: "Helsinki" continent_code: "EU" country: "Finland" country_code: "FI" ```
Share geo location down to the country level ```yaml autonomous_system_number: 24940 autonomous_system_organization: "Hetzner Online GmbH" continent_code: "EU" country: "Finland" country_code: "FI" city: "REDACTED" # REDACTED ```
Share geo location down to the continent level ```yaml autonomous_system_number: 24940 autonomous_system_organization: "Hetzner Online GmbH" continent_code: "EU" city: "REDACTED" # REDACTED country: "REDACTED" # REDACTED country_code: "REDACTED" # REDACTED ```
Share no geo location data ```yaml autonomous_system_number: 24940 autonomous_system_organization: "Hetzner Online GmbH" continent_code: "EU" city: "REDACTED" # REDACTED country: "REDACTED" # REDACTED country_code: "REDACTED" # REDACTED ```

Without ASN data

Share geo location down to the city level without ASN ```yaml city: "Helsinki" continent_code: "EU" country: "Finland" country_code: "FI" autonomous_system_number: REDACTED # REDACTED autonomous_system_organization: REDACTED # REDACTED ```
Share geo location down to the country level without ASN ```yaml continent_code: "EU" country: "Finland" country_code: "FI" autonomous_system_number: REDACTED # REDACTED autonomous_system_organization: REDACTED # REDACTED city: "REDACTED" # REDACTED ```
Share geo location down to the continent level without ASN ```yaml continent_code: "EU" autonomous_system_number: REDACTED # REDACTED autonomous_system_organization: REDACTED # REDACTED city: "REDACTED" # REDACTED country: "REDACTED" # REDACTED country_code: "REDACTED" # REDACTED ```

Get Started

Contributing to the Xatu dataset is currently restricted to known community members. We have plans to open this up to the public in the future, but for now, we want to ensure that the data remains high quality and relevant to the home staker community (read: we need to make sure our pipeline can handle the increased load 😂)

If you'd like to contribute to the Xatu dataset, please apply for access here

Once you've been granted access, you'll receive instructions on how exactly to run xatu sentry and start contributing to the dataset.

Docker

If you're already running a beacon node, running xatu sentry is as simple as running a docker container on your node. For example:

docker run -d \
  --name xatu-sentry \
  --restart unless-stopped \
  --cpus="0.5" \
  --memory="1g" \
  --read-only \
  ethpandaops/xatu:latest sentry \
  --preset ethpandaops \
  --beacon-node-url=http://localhost:5052 # Replace with your beacon node URL \
  --output-authorization=REDACTED # Replace with your output authorization key

Rocketpool

If you're running a Rocketpool node, you can contribute to the Xatu dataset by running xatu sentry with the following command:

docker run -d \
  --name xatu-sentry \
  --restart unless-stopped \
  --cpus="0.5" \
  --memory="1g" \
  --read-only \
  --network=rocketpool_net \
  ethpandaops/xatu:latest sentry \
  --preset ethpandaops \
  --beacon-node-url=http://eth2:5052 \
  --output-authorization="REDACTED" # Replace with your output authorization key

Binary

You can download the binary from our GitHub Releases page or use the install script.

Once you have the xatu binary, you can run it with the following command:

xatu sentry \
  --preset ethpandaops \
  --beacon-node-url=http://localhost:5052 # Replace with your beacon node URL \
  --output-authorization=REDACTED # Replace with your output authorization key

License

Maintainers

Sam - @samcmau

Andrew - @savid