bklaubos commented 4 years ago

Summary

Currently, Argo stored whatever artifact end user produces by tarring it and persisted it to a data storage of choice like S3. This produces non-efficient file format that can't be used effectively by distributed SQL engines like PrestoDB/PrestoSQL. Argo serialization/deserialization mechanism to/from the workflow artifact storage could take care of reading/writing to a default configured file format.

Use Cases

We uses AWS and Azure cloud providers.
In each cloud providers, we have several Argo clusters.
We run computation experiments across all clusters and all cloud providers.'
We need to be able to aggregate, transform and query results across them

Reference

Presto Hive Connectors Big Data Formats Demystified

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

alexec commented 4 years ago

What is "non-standard" about tar flies? Do you not want tar? CSV? Something else?

bklaubos commented 4 years ago

What is "non-standard" about tar flies? Do you not want tar? CSV? Something else?

CSV + JSON , or any "flat" file can be tarred up. Not an issue. The issue is when you are writing/reading them and merging artifacts and querying them, it gets quite slow. A few might not hurt. But a single workflow that produces few thousands artifacts with each few to 500MB are going to get a hit. By storing/reading to better Hadoop file formats, would allow faster processing. Most ETL and BI tools needs a SQL datasource. How to transform from .CSV,. JSON, AVRO, ORC or Parquet to a SQL stream is the job of a platform like PrestoDB/PrestoSQL. See Processing parquet files in Golang

alexec commented 4 years ago

Would you like to PoC this?

bklaubos commented 4 years ago

Let me think about it.

argoproj / argo-workflows

Support for Hadoop file format types when reading/writing artifacts: ORC, AVRO, Parquet #4163

Summary

Use Cases

Reference

Presto Hive Connectors Big Data Formats Demystified