fluxninja / aperture

Rate limiting, caching, and request prioritization for modern workloads
https://docs.fluxninja.com
Apache License 2.0
644 stars 25 forks source link

Ensure correctness of OLAP telemetry for multi-extractor Classifiers #564

Open tanveergill opened 2 years ago

tanveergill commented 2 years ago

Describe the solution you'd like

Additional context

krdln commented 2 years ago

Related: #534

krdln commented 2 years ago

After discussion with @DariaKunoichi – some more things to polish regarding classifier error-handling:

  1. Categorize errors into different kinds and perhaps treat them slightly differently

    1. context-timeout – we should just early return, without any attempt to log, etc. Perhaps bump some stats counter?
    2. errors caused by "invalid input" (eg. tried to extract a header, which is missing)
    3. errors caused by problem with rego itself (not sure if we can differentiate it with b)
    4. "internal errors" like https://github.com/fluxninja/aperture/blob/8228d34912ddafcd9b3725dae814485a9189b271/pkg/policies/dataplane/resources/classifier/classifier.go#L109 – they signify a breakage of some internal invariants and is not caused neither by policy or traffic, users should report an issue.

    Right now we treat all them the same way "Log and add to checkresponse", which is not ideal.

  2. Double check if multi-extractor classifier can handle "partial map" – eg. some extractors succeeded but some failed. If it's not possible, it's kinda sad, as error in one extractor could basically "disable" other extractor.