dewi-alliance / dewi-etl

Publicly available Helium blockchain data - https://etl.dewi.org
MIT License
9 stars 1 forks source link

Create regular public database snapshots #3

Closed jamiew closed 1 year ago

jamiew commented 3 years ago

Discussed with spillere --

For fast ETL spinup might actual full badgerdb too for ETL rather than simple ledger snapshot. I forget

dansku commented 3 years ago

Will be great to share the results here of how long will take to build the snapshot along with how big the files are. Find a solution on where to host them and also build the scripts to automate everything.

just to left it here, my current blockchain-etl has 35GB

jamiew commented 2 years ago

danksu did this:

http://etl-snapshots.dewi.org/

mfalkvidd commented 2 years ago

Has Dewi considered using the compressed output format of pg_dump for the ETL dumps? That would avoid the extra step to zip the dump (on Dewi's side) and to unzip (on the restore side). Would save time and disk space when restoring.

dansku commented 2 years ago

Has Dewi considered using the compressed output format of pg_dump for the ETL dumps? That would avoid the extra step to zip the dump (on Dewi's side) and to unzip (on the restore side). Would save time and disk space when restoring.

could you provide a dump/restore command to test?

mfalkvidd commented 2 years ago

Sorry, I can't. I've spent two weeks trying to follow various guides and advice on Discord but my Etl instance is still far from functional.

The pg_dump man page mentions the -Z flag for compression though, so adding compression should be as easy as adding the Z flag.

KnightAR commented 2 years ago

danksu did this:

http://etl-snapshots.dewi.org/

The site is giving a 500 Internal Server Error. :(

davetapley commented 1 year ago

Found this via need to do https://github.com/davetapley/helium-tax/issues/83 ahead of new tax year in USA. I see https://etl-snapshots.dewi.org/ is still going, but with 1TB file 😱

Just an idea: I've been using https://duckdb.org/ and its ability to use parquet files for another project (migrating from Postgres for OLAP work) and the file sizes are staggeringly smaller.

It gets even more mind-bending when you discover duckdb can reference non-local parquet files (e.g. on GCS) and due to the format can run queries against it without pulling the whole file 🀯

I see the ETL page @dansku built uses https://storage.googleapis.com (assuming that's GCS?), so it might not be too difficult to generate parquet files there instead.

Happy to do some work on it, if someone can point me to where http://etl-snapshots.dewi.org/ comes from?

jamiew commented 1 year ago

spillere would know more but I believe they are just Postgres dumps from the DeWi ETL server, which is private.

If you were able to provide a script or similar to dump to parquet we could probably run on your behalf, but we can’t provide direct access to the machine afaik

On Sat, Dec 24, 2022 at 9:34 AM Dave Tapley @.***> wrote:

Found this via need to do davetapley/helium-tax#83 https://github.com/davetapley/helium-tax/issues/83 ahead of new tax year in USA. I see https://etl-snapshots.dewi.org/ is still going, but with 1TB file 😱

Just an idea: I've been using https://duckdb.org/ and its ability to use parquet files https://duckdb.org/docs/data/parquet.html for another project (migrating from Postgres for OLAP work) and the file sizes are staggeringly smaller.

It gets even more mind-bending when you discover duckdb can reference non-local parquet files (e.g. on GCS) https://duckdb.org/docs/guides/import/s3_import.html and due to the format can run queries against it without pulling the whole file https://duckdb.org/2021/06/25/querying-parquet.html 🀯

I see the ETL page @dansku https://github.com/dansku built uses https://storage.googleapis.com (assuming that's GCS?), so it might not be too difficult to generate parquet files there instead.

Happy to do some work on it, if someone can point me to where http://etl-snapshots.dewi.org/ comes from?

β€” Reply to this email directly, view it on GitHub https://github.com/dewi-alliance/dewi-etl/issues/3#issuecomment-1364538523, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAO35IHHKHZ4HNVNHJ5M3WO4CWFANCNFSM45HV2NIA . You are receiving this because you authored the thread.Message ID: @.***>

-- @jamiew https://twitter.com/jamiew | https://jamiedubs.com https://www.jamiedubs.com/

dansku commented 1 year ago

Indeed we can do that @davetapley

davetapley commented 1 year ago

Awesome! It's a bit hard to test without access to DeWi ETL server, but I can outline script here and 🀞🏻 it'll be pretty easy to get running.

I just need oracle price history for purposes of https://github.com/davetapley/helium-tax/issues/83, so perhaps start with the oracle_prices table and add others as/when required?


Script:

  1. Download https://github.com/duckdb/duckdb/releases/download/v0.6.1/duckdb_cli-linux-amd64.zip
  2. Unzip and get duckdb binary
  3. Save SQL below into parquet_dump.sql (and set DeWi ETL server creds)
  4. Invoke with ./duckdb < parquet_dump.sql
  5. Copy exported parquet_dump folder to http://etl-snapshots.dewi.org/
INSTALL postgres;
LOAD postgres;
-- DeWi ETL server creds here:
CALL postgres_attach('host=HOST dbname=DBNAME user=USER password=PASS');

CREATE TABLE prices (block BIGINT, price BIGINT);
INSERT INTO prices (block, price) SELECT block, price FROM oracle_prices;
EXPORT DATABASE 'parquet_dump' (FORMAT PARQUET);

There's probably a way to go straight from Postgres to Parquet, but I figure it'd require a Postgres extension, and I know this approach works and only requires duckdb binary, which is nice.

dansku commented 1 year ago

oracle_prices.csv

this data is quite small

davetapley commented 1 year ago

@dansku nice. Is there anyway we can get that on http://etl-snapshots.dewi.org/ ?

Also: Want to try Parquet dump (as shown above) for rewards?

davetapley commented 1 year ago

@dansku could you dump that .csv one more time so I have all of 2022? πŸ™πŸ»

quite a lot of people asking for it over on:

which is blocked on:

dansku commented 1 year ago

hey @davetapley adding the oracle price dump again oracle_prices.csv

you can also do tax reporting with hotspotty.net