Closed aokolnychyi closed 10 months ago
I know @raptond has done some thinking. Could you share your thoughts?
cc @mehtaashish23 @rominparekh
Looks like @fbocse and @rdblue had discussed metrics library integration on this thread [1] in which this was the last thing suggested:
I'm not sure what I would want from DropWizard metrics. Most of the things we want to time happen just a few times in a job and are specific to a table.
For example, we want to know how long a particular query takes to plan. That is dependent on how large the table is and what filters were applied. That's why we've added a way to register listeners that can log those scan events for later analysis.
I think I would continue with this approach rather than adding a metrics library. The events that we want to time have to be grouped by table and need to be gathered from many runs of a job or a query. So it makes more sense to improve the events that are generated and the data those events contain.
@aokolnychyi @raptond Do we feel this approach addresses this or are there gaps?
Seems like we still need a way to emit non-timing metrics like #snapshots, #planfiles, #plantasks, #manifests, partition distribution across manifests.
@prodeezy, when do you want to emit those metrics? At commit time?
This seems a bit related to the CreateSnapshotEvent
that @rdsr added recently. That could be updated to have commit timing information.
The other fields, like number of snapshots, number of manifests seem like something you'd get live from the metadata tables. I'm not sure it makes sense to make the metrics, although we could keep track of them in snapshot summaries like we do total number of records and total number of files. Then the CreateSnapshotEvent
would send those metrics to listeners.
For #planfiles and #plantasks, what do you mean? For a given scan, the number of files and tasks? If so, then I think we should update ScanEvent
to send them.
For #planfiles and #plantasks, what do you mean? For a given scan, the number of files and tasks? If so, then I think we should update ScanEvent to send them.
Yes, and that makes sense to me. I'l look into bubbling up more stats. We have some more metrics we want to bubble up but from what I can see either CreateSnapshot
or ScanEvent
should cover them (if they don't already).
Having said that, I'l let @aokolnychyi and @raptond chime in as well as they have done some work around metrics reporting on their end and might have additional thoughts. Additionally I'd also like to know what stats other folks are maintaining on their end ( if outside Iceberg ), then maybe we can bring that into the SnapshotSummary
/ ScanEvent
.
Posting the conversation from last sync..
Gautam: how are others monitoring dataset health? Discussing additional metrics about datasets, like number of manifests, number of data files, etc. Ryan: What is the use case? Gautam: We want to know when tables are misconfigured or need attention. For example, when commits take too long due to retries, or when scan planning takes a long time. Are current stats enough? Ryan: That would be useful, we don’t currently do much to keep track of those. I think the current approach of emitting events to listeners is good, but we will need to add more data to those events, like how long a commit or planning run took. We can also start tracking total number of metadata files in snapshot summaries to emit with the events for context. Edgar: Should we add similar listeners to the catalogs? Ryan: That’s a good idea, so you could know how long it took to create or drop a table.
So seems like we want to stick to registering listeners. If one wants they can use a codahale metrics client to report/publish any metrics. does this sound good @raptond ? In that regard, we need some additional metrics that need to be bubbled up in the snapshot summary/scan events etc. I'l create an issue for that.
We have ScanMetrics API now. Closing.
I think it makes sense to consider using a metrics lib to measure things like commit time, job planning time, time to connect to the metastore, etc. Codahale seems to be a very popular option for this.
There are a couple of design decisions we need to make:
MetricsSystem
with its own sources and sinks. One option is to create only a custom metrics source for Spark and use the existing logic to register it. The main question is whether this approach is flexible enough and will work with arbitrary query engines. As opposed to this, Iceberg can have its ownMetricsSystem
and its own way to register and report metrics.planFiles
/planTasks
? If yes, we will have to modify the iterators. Alternatively, we can say that metrics should be collected at the data source level. However, this restricts us in what metrics we can actually collect.