AbsaOSS / spline-spark-agent

Spline agent for Apache Spark
https://absaoss.github.io/spline/
Apache License 2.0
185 stars 94 forks source link

Is it possible to get count of records for data sources? #401

Open abhineet13 opened 2 years ago

abhineet13 commented 2 years ago

Background [Optional]

Similar to spark UI which shows count of records for each stage, would like to understand if counts can be available for spline lineage.

Question

Hi, Is it possible to get count of records for data sources and before/after transformations like filter, join?

wajda commented 2 years ago

Execution metrics are only available in a physical plan while Spline's main focus is a logical one. There is no direct correlation between both, so we cannot project physical plan metrics to the logical operators. The agent does collect some high-level read and write metrics, but they are associated with the job execution event as a whole, not with individual operators. The maximum we can do is to preserve the original layout for those metrics as represented in the Spark physical plan (currently they are all combined).

{
  "_created": 1640809837062,
  "durationNs": 1000845670,
  "execPlanDetails": {...},
  "extra": {
    "appId": "local-1640809828533",
    "readMetrics": { // <-------------- Combined read metrics
      "numOutputRows": 3
    },
    "writeMetrics": { // <-------------- Combined write metrics
      "numFiles": 3,
      "numOutputBytes": 2301,
      "numOutputRows": 3,
      "numParts": 0
    }
  },
  "labels": {...},
  "timestamp": 1640809836954
}
abhineet13 commented 2 years ago

Thanks, combined read/write metrics will be helpful.

wajda commented 2 years ago

combined ones are already there