apache / inlong

Apache InLong - a one-stop, full-scenario integration framework for massive data
https://inlong.apache.org/
Apache License 2.0
1.4k stars 530 forks source link

[Feature][Sort] Enhanced source metric instrumentation for InLong Sort Flink Connector #11129

Closed PeterZh6 closed 1 month ago

PeterZh6 commented 2 months ago

Description

Parent Issue: [Feature][Umbrella] Tencent Rhino-bird: Sort metric monitoring and reporting #10961

Description: This feature focuses on SourceMetric only This feature introduces enhanced metric instrumentation to improve observability within the InLong Sort Flink Connector, specifically for the Postgres-CDC connector. The newly added metrics in org.apache.inlong.sort.base.metric.SourceExactlyMetriccover deserialization processes, snapshot states, and checkpoint completion.

Key Metric Categories:

  1. Serialization/Deserialization Metrics:

    • Success/Error Counters: Track successful and failed deserialization attempts (numDeserializeSuccess, numDeserializeError).
    • Latency Gauges: Measure the time taken for both serialization and deserialization (deserializeTimeLag, serializeTimeLag).
  2. SnapshotState Metrics:

    • Creation/Error Counters: Monitor the number of snapshots created and errors encountered during snapshot operations (numSnapshotCreate, numSnapshotError).
  3. NotifyComplete Metrics:

    • Completed Snapshots Counter: Track the number of completed checkpoints (numCompletedSnapshots).
    • Snapshot-to-Checkpoint Latency: Record the time between snapshot creation and checkpoint completion (snapshotToCheckpointTimeLag).

Implementation Details:

The metrics are integrated into the Postgres-CDC connector (located in inlong-sort/sort-flink/sort-flink-v1.15/sort-connectors/postgres-cdc) and can be adapted for use in other connectors. Specific changes are made in key methods like deserialize(), snapshotState(), and notifyCheckpointComplete() to gather detailed performance and error data.

This feature enhances monitoring capabilities, providing critical insights into serialization/deserialization performance, checkpoint processes, and other key aspects of the connector's operation.

Use case

No response

Are you willing to submit PR?

Code of Conduct

github-actions[bot] commented 2 months ago

Hello @PeterZh6, thank you for opening your first issue in InLong 🧡 We will respond as soon as possible ⏳ If this is a bug report, please provide screenshots or error logs for us to reproduce your issue, so we can do our best to fix it. If you have any questions in the meantime, you can also ask us on the InLong Discussions 🔍