Is your feature request related to a problem? Please describe.
When running the parquet sink in the cloud, it's annoying to have to use persistent disks and manually upload the parquet files to S3. We should have a way to automatically upload parquet files as they're produced.
Describe the solution you'd like
If the user specifies an --output-dir that starts with s3://, write to that S3 bucket + subdirectory. If the output dir doesn't have any prefix or the prefix is file://, write to file (current behaviour).
Additional context
It's probably enough to change write_batch to serilazie data to BytesMut, then write the bytes to a file or upload it to S3.
where filepath is the full path relative to the writer root (the path or bucket specified by the user) and data is the serialized content (the output of writer in the current implementation).
Is your feature request related to a problem? Please describe.
When running the parquet sink in the cloud, it's annoying to have to use persistent disks and manually upload the parquet files to S3. We should have a way to automatically upload parquet files as they're produced.
Describe the solution you'd like
If the user specifies an
--output-dir
that starts withs3://
, write to that S3 bucket + subdirectory. If the output dir doesn't have any prefix or the prefix isfile://
, write to file (current behaviour).Additional context
It's probably enough to change
write_batch
to serilazie data toBytesMut
, then write the bytes to a file or upload it to S3.We need a trait like the following:
where
filepath
is the full path relative to the writer root (the path or bucket specified by the user) and data is the serialized content (the output ofwriter
in the current implementation).