FlinkCluster sidecar / BeamJob container : mapped drive handling for input vs output parameters

vp999 commented 4 years ago

I am trying to use common flinkcluster and running different types of job using same cluster (one job at a time) Here is the setup 1) FlinkCluster yaml for cluster, volume Data is mapped to etlresearchfileshare

2) job,yaml For job, volume Data is mapped to etlresearchfileshare/demo2

parameters sent to job : input and output

Issue : I have observed that input parameter is job is correctly evalauted to etlresearchfileshare/demo2/input, but output parameters are evaluated as per mounted volume in flinkcluster i.e. evaluated as etlresearchfileshare/output. I could see output file getting created in etlresearchfileshare/output instead of etlresearchfileshare/demo2/output

seems discrepancy in handling input vs output parameters . please note that there is no etlresearchfileshare/input folder, so the input parameter is correctly evaluated as per job yaml (i,e, etlresearchfileshare/demo2/input) but output parameter is getting evaluated as per cluster yaml (i.e. etlresearchfileshare/output/)

functicons commented 4 years ago

Yep, the --input and --output parameters are interpreted by the Beam WordCount job, they are transparent to the operator. Usually in prod environment, you want to use remote storage for both (e.g., HDFS, GCS, S3, Azure Blob Storage, etc).

vp999 commented 4 years ago

I used here azure file storage. The problem is different. There are two containers one for side and other for job. Each container has drive mounted on it. I observed that for output parameter, sidecar container drive is used and for input parameter job container drive is used. Kind of mismatch in handling .

functicons commented 4 years ago

IIIC, Azure file storage is not Azure blob storage, the former is local file system while the latter is distributed file system. Usually we want to use distributed file systems in prod.

GoogleCloudPlatform / flink-on-k8s-operator

FlinkCluster sidecar / BeamJob container : mapped drive handling for input vs output parameters #304