Open bklaubos opened 4 years ago
What is "non-standard" about tar flies? Do you not want tar? CSV? Something else?
What is "non-standard" about tar flies? Do you not want tar? CSV? Something else?
CSV + JSON , or any "flat" file can be tarred up. Not an issue. The issue is when you are writing/reading them and merging artifacts and querying them, it gets quite slow. A few might not hurt. But a single workflow that produces few thousands artifacts with each few to 500MB are going to get a hit. By storing/reading to better Hadoop file formats, would allow faster processing. Most ETL and BI tools needs a SQL datasource. How to transform from .CSV,. JSON, AVRO, ORC or Parquet to a SQL stream is the job of a platform like PrestoDB/PrestoSQL. See Processing parquet files in Golang
Would you like to PoC this?
Let me think about it.
Summary
Currently, Argo stored whatever artifact end user produces by tarring it and persisted it to a data storage of choice like S3. This produces non-efficient file format that can't be used effectively by distributed SQL engines like PrestoDB/PrestoSQL. Argo serialization/deserialization mechanism to/from the workflow artifact storage could take care of reading/writing to a default configured file format.
Use Cases
Reference
Presto Hive Connectors Big Data Formats Demystified
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.