commercetools / mongodbatlas_exporter

MongoDB Atlas exporter for Prometheus
MIT License
13 stars 8 forks source link

ProcessRegisterer #14

Closed Freyert closed 2 years ago

Freyert commented 2 years ago

This would close #12.

This has already eliminate the issues with "not_found/not_registered" metrics such as OPLOG_LAG which only appears on secondaries.

I also tried it out on sharded clusters and saw that it correctly scraped mongos instances without any "not_found/not_registered" issues. (Since the original implementation tried to find all metrics on all hosts it would make a lot of errors on hosts like mongos that only have a subset of metrics).

👀 It is also MUCH faster at scraping. This could become an issue since we have to do an API call per process/disk.

Remaining items:

Freyert commented 2 years ago

I think this is at a good place to focus on testing.

Rate limiting and dynamically removing nodes should be deferred to other PRs.

I would also like to work on having this be a NET reduction in code. So if I can drop 176 lines or more that'd be good.

Freyert commented 2 years ago

Reworked the tests to be a little bit easier to extend.

getExpectedDescs ingests test inputs for Describe functions and outputs the expected descriptions list. You can see one of these inputs in the variable commontestExpectedDescs which is just a [][]string. This removes the need for very large functions like https://github.com/commercetools/mongodbatlas_exporter/pull/14/files#diff-2995123d27fef611eb35258c56656158bb6c0431ac07ed2b0be0fb72828ab4d5L53-L78.

There is now a TestProcessDescribe which tests the additional metrics coming from Processes and Disks that were not tested by TestDescribe originally. TestProcessDescribe makes use of most of the functionality from TestDescribe but modifies the expected descriptions accordingly.


TestDescribe passes now, but I have a missing metric in TestProcessDescribe:

--- FAIL: TestProcessDescribe (0.00s)
    /Users/fultonbyrne/Desktop/mongodbatlas_exporter/collector/common_test.go:138: actual missing fqname mongodbatlas_processes_stats_query_executor_scanned_ratio
FAIL
FAIL    mongodbatlas_exporter/collector 0.153s
FAIL

The tests have captured some mistakes I was making with labels that lead me to change BuildPromMetrics to have no caller as it was always using Base.PromConstLabels instead of the Caller.PromConstLabels I would want. This has been fixed, but I'm just pointing out the tests are important.

Freyert commented 2 years ago

Tests are passing. Test inputs are still fairly confusing. Main issue is it's difficult to tell if I've changed labels between this PR and the main branch.

Will scrutinize current metrics for a compare and contrast.

Most of the code add is probably from all the reflect I'm using to make the test outputs easy to understand. Perhaps ginkgo and gomega have better diff outputs?

Freyert commented 2 years ago

labels should be consistent now and I added tests to make it more clear in the future.

Freyert commented 2 years ago

OK, problem with the process info label. It gets registered with the replica's TYPE when the collector is registered. So this never changes. So we wouldn't see if the member transitioned between primary/secondary.

My vote is to look for a numeric version of that metric and just remove the label if we can find it. There isn't a numeric version of the metric. We'd have to know all the enumerations of TYPE and assign values to them.

https://docs.atlas.mongodb.com/reference/api/processes-get-one/

REPLICA_PRIMARY REPLICA_SECONDARY RECOVERING SHARD_MONGOS SHARD_CONFIG SHARD_STANDALONE SHARD_PRIMARY SHARD_SECONDARY NO_DATA

We could also change the InfoGauge to an InfoGaugeVec and have the type be a variable label, but that's not pretty either 😞 . I think that would be the least pretty and the enumeration may be the most pretty.

Freyert commented 2 years ago

We are collecting http error codes from the reigsterer: https://github.com/commercetools/mongodbatlas_exporter/pull/14/files#diff-db5c03474e9176ff2c39071291d84e444a5da02804126e6bdc417a1fafba95b2R54

It's the only reason so far that HTTPError is actually needed. I'm hoping there's a better way to instrument an HTTP Client.

I've created an issue to follow up on this PR with a more efficient and global way of instrumenting a count of HTTP status codes from the client. https://github.com/commercetools/mongodbatlas_exporter/issues/16

Don't want to do it here as this is already a very big and complex PR.