apibara / dna

Apibara is the fastest platform to build production-grade indexers that connect onchain data to web2 services.
https://www.apibara.com/
Apache License 2.0
178 stars 32 forks source link

Write Parquet datasets to remote storage #315

Closed fracek closed 6 months ago

fracek commented 8 months ago

Is your feature request related to a problem? Please describe.

When running the parquet sink in the cloud, it's annoying to have to use persistent disks and manually upload the parquet files to S3. We should have a way to automatically upload parquet files as they're produced.

Describe the solution you'd like

If the user specifies an --output-dir that starts with s3://, write to that S3 bucket + subdirectory. If the output dir doesn't have any prefix or the prefix is file://, write to file (current behaviour).

Additional context

It's probably enough to change write_batch to serilazie data to BytesMut, then write the bytes to a file or upload it to S3.

We need a trait like the following:

pub trait DatasetWriter {
  async fn write_parquet(&mut self, filepath: impl Into<String>, data: &[u8]) -> Result<()>;
}

where filepath is the full path relative to the writer root (the path or bucket specified by the user) and data is the serialized content (the output of writer in the current implementation).

bigherc18 commented 7 months ago

This is already done in #320