apache / skywalking

APM, Application Performance Monitoring System
https://skywalking.apache.org/
Apache License 2.0
23.64k stars 6.49k forks source link

[Feature] Add metric for analyzing cache-related span like Virtual Database #9556

Closed pg-yang closed 1 year ago

pg-yang commented 1 year ago

Search before asking

Description

Virtual Cache will analysis span which is cache-related . It will generate SLA, response time or some else metric In a word , it's similar to Virtual Database Reference virtual-database skywalking demo

Works should to be done :

Agent (focus on Java agent)

Backend

Use case

No response

Related issues

No response

Are you willing to submit a PR?

Code of Conduct

pg-yang commented 1 year ago

Maybe , we should expand the meaning of Database . Virtual Database should contains cache instead of creating Virtual Cache

wu-sheng commented 1 year ago

I think they are different, basically. Cache is much faster than database.

pg-yang commented 1 year ago

They are used for saving data , and contain statement too . We provide slow threshold config for various DB

agent-analyzer:
  selector: ${SW_AGENT_ANALYZER:default}
  default:
    slowDBAccessThreshold: ${SW_SLOW_DB_THRESHOLD:default:200,mongodb:100,redis:20} 

We just add redis:20

pg-yang commented 1 year ago

And we could append dbType to service name like Mysql:192.168.1.1:3306 or Redis:192.168.1.1:6379, or add column named dbType to UI

wu-sheng commented 1 year ago

Yes, I don't think this is tech issue. I just talked from design perspective.

pg-yang commented 1 year ago

design perspective

It depends how you define Database . Design comes from Definition .

wu-sheng commented 1 year ago

Let's say in this way, cache should be related to more metrics, such as cache hint is an important metric, database doesn't care about hint rate.

pg-yang commented 1 year ago

Hint ? means cache hit ? Data store system can response a result , whether the result is empty or not .From client-server communication protocol perspective , they are all successful . If we want to get hit rate , we should decode response , It's better if the response have hit-state . However, we cannot guarantee that every cache system protocol has hit-state Finally , Hit also is appropriate for Database (such as Mysql) .

wu-sheng commented 1 year ago

You just described it from a general side, I mean for layer, and for apm, we should understand the service. In every official architecture, cache service and database service are two things. This is why we have cache and service span type from many years ago. You should not say they arevsame just because they have similar metric.

pg-yang commented 1 year ago

OK , I 'll follow current rule . For a system the concept should be consistent at least

wu-sheng commented 1 year ago

And another thing is, that some caches are not remote services. The cache could be local memory.

pg-yang commented 1 year ago

Should we analysis local span for local memory cache such as guava-cache ?

wu-sheng commented 1 year ago

Yes, we should, and we should provide a format rule to difference local and remote caches.

pg-yang commented 1 year ago

The new config look like this :

agent-analyzer:
     #......
    virtualService:
      database:
        slowAccessThreshold: ${SW_VIRTUAL_DB_SLOW_THRESHOLD:default:200,mongodb:100}
        maxSlowStatementLength: ${SW_VIRTUAL_DB_SLOW_MAX_LENGTH:default:200,mongodb:100}
      cache:
        slowAccessThreshold: ${SW_VIRTUAL_CACHE_SLOW_THRESHOLD:default:80,redis:30}
        maxSlowStatementLength: ${SW_VIRTUAL_CACHE_SLOW_MAX_LENGTH:default:200,redis:100}
        enableLocal: false # This  term  define  whether  analysis  local cache such  as  Guave-Cache ,Ecache 

Do you have any suggestion?

wu-sheng commented 1 year ago

Does the kernal support this kind of structure? I remember only 3 levels supported.

pg-yang commented 1 year ago

Yes , only 3 levels supported , I'll reorganize this config ,

pg-yang commented 1 year ago
agent-analyzer:
  selector: ${SW_AGENT_ANALYZER:default}
  default:
    # .....
    databaseSlowAccessThreshold: ${SW_VIRTUAL_DB_SLOW_THRESHOLD:default:200,mongodb:100}
    databaseMaxSlowStatementLength: ${SW_VIRTUAL_DB_SLOW_MAX_LENGTH:default:200,mongodb:100}
    cacheSlowAccessThreshold: ${SW_VIRTUAL_CACHE_SLOW_THRESHOLD:default:80,redis:30}
    cacheMaxSlowStatementLength: ${SW_VIRTUAL_CACHE_SLOW_MAX_LENGTH:default:200,redis:100}
    analysisLocalCache: false # This  term  define  whether  analysis  local cache such  as  Guave-Cache ,Ecache

BTW , I want to modify the kernel for supporting more levels in future

wu-sheng commented 1 year ago

BTW , I want to modify the kernel for supporting more levels in future

The reason we don't recommend more levels is that, nested YAML file is very hard to be controlled through system env variables, which are the key parts of how helm/operator(cloud-native) components control the OAP behaviors in the setup stage.

wu-sheng commented 1 year ago

cacheMaxSlowStatementLength: ${SW_VIRTUAL_CACHE_SLOW_MAX_LENGTH:default:200,redis:100}

Is this necessary? What kinds of statements we could get? Could you provide an example? Such as in Redis case, if this statement doesn't include key, then mostly, it is pointless.

And for cache, put usually is much slower than read. Threshold should be separated for read/write or general operation.

pg-yang commented 1 year ago

if this statement doesn't include key

Yes, sometimes we couldn't get the key from cache plugin , we could show slow command in UI at least . It's better than noting . jedis-4.x -plugin :

- {key: db.statement, value: HSET a}
- {key: db.statement, value: SET key}
- {key: db.statement, value: XADD abc}
- {key: db.statement, value: XREAD}    # only command , no key !!

Threshold should be separated for read/write or general operation.

Agree

wu-sheng commented 1 year ago

Yes, sometimes we couldn't get the key from cache plugin , we could show slow command in UI at least . It's better than noting .

Not really. If we show meaningless information, we cost the resources(CPU, storage, network, a.k.a. user's money) for nothing. In your case key: db.statement, value: XREAD should not be sampled as a slow statement, no matter how slow it is. Because in the APM use case, ppl(ops team) could be nothing to get rid of it. We have provided the latency(avg and percentile) to show the access speed.

Also, besides read and write slow access threshold, metrics should be separated too.

pg-yang commented 1 year ago

We can't distinguish between reading and writing unless we modify all cache plugin for uniform read/write flag Also , We can't know if the statement include key .

wu-sheng commented 1 year ago

OK, then the value of these metrics would be lower than we could be.

pg-yang commented 1 year ago

Now , we could distinguish write/read operation . But we could't guarantee every span contains cache.key and cache.op. Next question is which span should be analyzed ? I think : When the span contains op , we generate response time , cpm ,sla , etc for write/read operation respectively When the span contains key , we generate TopN , we could't generate Write-TopN ,Read-TopN , as the op maybe not exist . If we want Write-TopN ,Read-TopN , we should analysis the complete span (which contains op and key ) .Anyway it's meaningful that show the slow keys . Right ?

wu-sheng commented 1 year ago

Slow read and write are meaningful. We should use op and key as the samplrd statement.

We could only analysis this when relative tags existing

pg-yang commented 1 year ago

make sense to me

pg-yang commented 1 year ago

I don't know how to test with UI , there is any doc ?

wu-sheng commented 1 year ago

Do you have virtual cache layer in the backend? Once you added, you could use customer dashboard to set up dashboards for the new layer.

Docs are here. https://skywalking.apache.org/docs/main/next/en/ui/readme/

Also this PR is a good place to learn for new dashboard setup steps, https://github.com/apache/skywalking/pull/9432

pg-yang commented 1 year ago

I have already created following json config in backend

In UI
added to general.ts:

      {
        path: "/cache",
        name: "VirtualCache",
        meta: {
          title: "VirtualCache",
          exact: true,
        },
        component: () =>
          import(/* webpackChunkName: "layer" */ "@/views/Layer.vue"),
      },
      {
        path: "/cache/tab/:activeTabIndex",
        name: "VirtualCacheActiveTabIndex",
        meta: {
          notShow: true,
        },
        component: () =>
          import(/* webpackChunkName: "layer" */ "@/views/Layer.vue"),
      }

added to data.ts

export const RoutesMap: { [key: string]: string } = {
  VirtualCache: "VIRTUAL_CACHE",
  VirtualCacheActiveTabIndex: "VIRTUAL_CACHE"
}

But Please set a root dashboard for VIRTUAL_CACHE when I visit page

pg-yang commented 1 year ago

image

wu-sheng commented 1 year ago

The dashboard of virtual-cache-root should be set as root at the dashboard management page. I am not sure how you wrote these JSONs, generally, they are exported from the dashboard management page too.

pg-yang commented 1 year ago

Because of browser cache. It works now.

image

wu-sheng commented 1 year ago

Do you set the key as a or key on purpose? After we have this in the codebase, we need to update showcase. So, we need to set keys in the demo services clearly, as well as e2e

wu-sheng commented 1 year ago

Is read/write success related to error status only?

pg-yang commented 1 year ago

Is read/write success related to error status only?

Yes

cache_write_sla = from(VirtualCacheWrite.*).percent(status == true);
cache_read_sla = from(VirtualCacheRead.*).percent(status == true);

Do you set the key as a or key on purpose? After we have this in the codebase, we need to update showcase. So, we need to set keys in the demo services clearly, as well as e2e

The showcase crash now , I would provide clearly test data after writing unit test .

wu-sheng commented 1 year ago

Showcase could run locally on KinD and DockerCompose. demo is just another host.

wu-sheng commented 1 year ago

Showcase could run locally on KinD and DockerCompose. demo is just another host.

pg-yang commented 1 year ago

Showcase means skywalking-showcase repo? And should I provide e2e-test? I have consulted some e2e-test in OAP . I think my e2e-test should integrate with agent , but I could't found similar test .May I misunderstand the e2e in OAP

wu-sheng commented 1 year ago

e2e has cases with agents. What do you mean?

wu-sheng commented 1 year ago

Showcase could be done after e2e. Because e2e is checking features, because we are confident to merge. Yes, showcase is the repo.

pg-yang commented 1 year ago

e2e has cases with agents. What do you mean?

I'm confused that I couldn't found java agent options in backend project . Now, I think I have found the correct entrance about e2e .

  1. The env JAVA_TOOL_OPTIONS is set to docker image ghcr.io/apache/skywalking-java/skywalking-java
  2. I should update e2e-service-consumer , e2e-service-provider to add redis test code
  3. Then create e2e.yaml (maybe need to extend base-compose.yml) to trigger,verify my test
    Right ?
wu-sheng commented 1 year ago

You may not need to change <3>, instead, you should adjust metrics verification and service list in existing e2e. Once <2> is done, the analysis would be done automatically.

wu-sheng commented 1 year ago

You could use an in-memory cache, if a redis service is why you thought you should change base-compose.yml

pg-yang commented 1 year ago
➜  skywalking git:(feature/vitrual-cache) ✗  /Users/HOME/GolandProjects/skywalking-infra-e2e/bin/darwin/e2e run -c test/e2e-v2/cases/meter/e2e.yaml
INFO load the e2e config successfully
Pulling oap (skywalking/oap:latest)...
The image for the service you're trying to recreate has been removed. If you continue, volume data could be lost. Consider backing up your data before continuing.

Continue with the new image? [yN]pull access denied for skywalking/oap, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
ERROR Local Docker compose exited abnormally whilst running docker-compose: [--env-file /Users/HOME/ideaProjects/skywalking/test/e2e-v2/script/env up -d]. exit status 1

I couldn't find skywalking/oap in docker hub . I don't understand where the image would download

wu-sheng commented 1 year ago

It is not on docker hub, they are here, https://github.com/orgs/apache/packages?repo_name=skywalking

pg-yang commented 1 year ago

It looks like download image from docker hub . I just execute above shell . May I loss some steps

wu-sheng commented 1 year ago

Do you look for this? https://github.com/apache/skywalking/blob/master/test/e2e-v2/script/docker-compose/base-compose.yml#L58

pg-yang commented 1 year ago

No , I mean the image is defined as skywalking/oap:latest in https://github.com/apache/skywalking/blob/master/test/e2e-v2/script/docker-compose/base-compose.yml#L20

services:
  oap:
    image: skywalking/oap:latest
    expose:
      - 11800
      - 12800
      - 10051
      - 5005

It should download from docker hub unless some shell (or setup) specify a docker repo . So I guess I loss some shell/setup .

wu-sheng commented 1 year ago

Latest is your local build based on the latest code.

https://github.com/apache/skywalking/blob/b3c7658a0eb5a5acf7b27c8e3fc921589accf922/.github/workflows/skywalking.yaml#L177

pg-yang commented 1 year ago

You could use an in-memory cache, if a redis service is why you thought you should change base-compose.yml

in-memory cache (e.g. guava-cache, ehcache) plugin is optional . I would add redis docker image to base-compose.yml. It's OK ?