Open lgo opened 4 years ago
single jar vs multiple external jars is something that keeps coming up once in a while. The biggest challenge is dealing with various versions of executions framework (spark, hadoop), file systems (hdfs, s3,gcs,adsl) and data format (json, csv, parquet, avro, protobuf) etc.
Creating one jar for each combination is not scalable. that's the reason we added the concept of plugins and allowed users to include only the plugin's needed.
One other alternative is to provide a standalone tool that takes plugin names as input and create a uber jar. WDYT?
While trying to set up ingestion, I ran across the
-propertyFile
and-values
arguments inLaunchDataIngestionJobCommand
. I didn't see anything in the docs about templating the job spec, and only later found one reference describing it in https://github.com/pinot-contrib/pinot-docs/blob/eb9a8a07687bfe78b022ba0825123fd43e316795/operators/cli.md.This would be helpful to document and also answer questions such as:
propertyFile
is.My particular use-case where this is great is in a setup where ingestion (via Spark or Hadoop) are only distributed as single JAR (compiled with deps, rather than distributed external JARs), and hooking up external file dependencies is a pain. For this, I'd ideally like to (1) bundle a basic configuration file as a resource in the JAR or as a separate distribution (2) provide any overrides at run-time via parameters (eg: by a scheduler application)
(Separately, it looks like using JAR resources for args like that aren't supported 🤔 I have to look a bit more into whether or not that's necessary. Is that normally a sane thing for Java applications to support?)