Using BigQuery for big block data analysis

staheri14 commented 7 months ago

Problem

During a test involving 100 nodes, we accumulate approximately 100GB of traced data. Analyzing this data typically requires downloading it locally to perform queries, or selectively fetching subsets relevant to the analysis. Both approaches are constrained by the limitations of individual machines, such as CPU capacity and disk space. To enhance efficiency, we propose using BigQuery. This solution would allow us to retain our data in the cloud, enabling us to perform queries without needing to download the data or rely on the limited resources of individual devices.

Acceptance Criteria

This task involves two primary objectives:

Identifying all the requirements necessary for interfacing with our data using BigQuery.
Modifying or extending the tracing push configuration to enable direct data upload to BigQuery.

staheri14 commented 6 months ago

I'd like to share some updates:

BigQuery supports the JSON data format, which is compatible with our new locally traced logs format.
I've successfully loaded a traced log file into the BigQuery console.
BigQuery can create a table from a JSON file using either a user-supplied schema or by automatically detecting the schema. I opted for the latter, and it worked seamlessly. Here is a sample schema extracted from mempool_tx.json.
We can execute SQL-like queries in BigQuery. For example, here's a query and its outcome:
```
SELECT * FROM `knuu-422119.node_logs.mempool_tx` LIMIT 1000
```
BigQuery also supports creating notebooks, similar to Jupyter notebooks. I plan to explore this feature further and will provide more details soon.
All operations can be performed via the web UI, eliminating the need to download data locally.
Additionally, BigQuery can connect to external resources, such as an Amazon S3 bucket, to load data. However, I opted in uploading the data manually.

I'll keep you updated as I explore more features.

staheri14 commented 6 months ago

After further investigation, discovered that to use Jupyter Notebook with BigQuery, we need to utilize Vertex AI Workbench. This managed service provided by Google Cloud offers the following capabilities that suit our use cases:

Offers a Jupyter-based IDE that is pre-configured with popular data science and machine learning libraries.
Seamlessly integrates with other Google Cloud services, including BigQuery.

celestiaorg / celestia-app

Using BigQuery for big block data analysis #3389

Problem

Acceptance Criteria