apache / skywalking

APM, Application Performance Monitoring System
https://skywalking.apache.org/
Apache License 2.0
23.7k stars 6.5k forks source link

[Feature] Add metric for analyzing cache-related span like Virtual Database #9556

Closed pg-yang closed 1 year ago

pg-yang commented 2 years ago

Search before asking

Description

Virtual Cache will analysis span which is cache-related . It will generate SLA, response time or some else metric In a word , it's similar to Virtual Database Reference virtual-database skywalking demo

Works should to be done :

Agent (focus on Java agent)

Backend

Use case

No response

Related issues

No response

Are you willing to submit a PR?

Code of Conduct

pg-yang commented 2 years ago

It's better that updating docker-compose.yml in e2e-test case rather than base-compose.yml

wu-sheng commented 2 years ago

Please don't do that. e2e only has 3U8G for the whole env, another server(redis) would impact the stability.

An optional plugin is able to activate. Is there any issue to do so? Showcase also includes optional plugin activation, such as Spring Gateway.

pg-yang commented 2 years ago

OK, I would study how to activate optional plugin in e2e

pg-yang commented 2 years ago

I wrote TopNCacheReadCommand.java like TopNDatabaseStatement , It extend TopN too
BTW TopN extend Record.

https://github.com/apache/skywalking/blob/96faed0185aceefdc30fcd2fb2d9291ff2959ca2/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/manual/database/TopNDatabaseStatement.java#L37

I use this command to get topN record

swctl --display yaml --base-url=http://127.0.0.1:12800/graphql metrics top   --start=-60m --name=top_n_cache_read_command  --service-id=R3VhdmFDYWNoZS1sb2NhbA==.0

top_n_cache_read_command is index(table)name , defined in TopNCacheReadCommand And OAP cause exception

Caused by: org.h2.jdbc.JdbcSQLSyntaxErrorException: Column "ENTITY_ID" not found; SQL statement:
select * from (select avg(latency) result,entity_id from top_n_cache_read_command where time_bucket >= ? and time_bucket <= ? and service_id=? group by entity_id)  as T order by result desc limit 5 [42122-212]

I have confirm TopN class don't define ENTITY_ID , But Metrics class defined . I could'n find other example about TopN metrics . So , I'm not sure the error is caused by swctl command or my code

The complete error stack in OAP

Caused by: org.h2.jdbc.JdbcSQLSyntaxErrorException: Column "ENTITY_ID" not found; SQL statement:
select * from (select avg(latency) result,entity_id from top_n_cache_read_command where time_bucket >= ? and time_bucket <= ? and service_id=? group by entity_id)  as T order by result desc limit 5 [42122-212]
    at org.h2.message.DbException.getJdbcSQLException(DbException.java:502) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.message.DbException.getJdbcSQLException(DbException.java:477) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.message.DbException.get(DbException.java:223) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.message.DbException.get(DbException.java:199) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.expression.ExpressionColumn.getColumnException(ExpressionColumn.java:244) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.expression.ExpressionColumn.optimizeOther(ExpressionColumn.java:226) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.expression.ExpressionColumn.optimize(ExpressionColumn.java:213) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.command.query.Select.prepareExpressions(Select.java:1170) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.table.DerivedTable.<init>(DerivedTable.java:41) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.command.query.Query.toTable(Query.java:1004) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.command.Parser.readDerivedTableWithCorrelation(Parser.java:1936) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.command.Parser.readTablePrimary(Parser.java:1816) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.command.Parser.readTableReference(Parser.java:2359) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.command.Parser.parseSelectFromPart(Parser.java:2810) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.command.Parser.parseSelect(Parser.java:2917) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.command.Parser.parseQueryPrimary(Parser.java:2800) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.command.Parser.parseQueryTerm(Parser.java:2667) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.command.Parser.parseQueryExpressionBody(Parser.java:2646) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.command.Parser.parseQueryExpressionBodyAndEndOfQuery(Parser.java:2639) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.command.Parser.parseQueryExpression(Parser.java:2632) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.command.Parser.parseQuery(Parser.java:2593) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.command.Parser.parsePrepared(Parser.java:724) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.command.Parser.parse(Parser.java:691) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.command.Parser.parse(Parser.java:661) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.command.Parser.prepareCommand(Parser.java:568) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.engine.SessionLocal.prepareLocal(SessionLocal.java:631) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.engine.SessionLocal.prepareCommand(SessionLocal.java:554) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.jdbc.JdbcConnection.prepareCommand(JdbcConnection.java:1116) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.jdbc.JdbcPreparedStatement.<init>(JdbcPreparedStatement.java:92) ~[h2-2.1.212.jar:2.1.212]
    at org.h2.jdbc.JdbcConnection.prepareStatement(JdbcConnection.java:288) ~[h2-2.1.212.jar:2.1.212]
    at com.zaxxer.hikari.pool.ProxyConnection.prepareStatement(ProxyConnection.java:318) ~[HikariCP-3.1.0.jar:?]
    at com.zaxxer.hikari.pool.HikariProxyConnection.prepareStatement(HikariProxyConnection.java) ~[HikariCP-3.1.0.jar:?]
    at org.apache.skywalking.oap.server.library.client.jdbc.hikaricp.JDBCHikariCPClient.executeQuery(JDBCHikariCPClient.java:119) ~[classes/:?]
    at org.apache.skywalking.oap.server.storage.plugin.jdbc.h2.dao.H2AggregationQueryDAO.sortMetrics(H2AggregationQueryDAO.java:70) ~[classes/:?]
    at org.apache.skywalking.oap.server.core.query.AggregationQueryService.sortMetrics(AggregationQueryService.java:63) ~[classes/:?]
    at org.apache.skywalking.oap.query.graphql.resolver.MetricsQuery.sortMetrics(MetricsQuery.java:151) ~[classes/:?]
    at sun.reflect.GeneratedMethodAccessor95.invoke(Unknown Source) ~[?:?]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_321]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_321]
    at graphql.kickstart.tools.resolver.MethodFieldResolverDataFetcher.get(MethodFieldResolver.kt:280) ~[graphql-java-tools-12.0.2.jar:?]

The http info

.b...B&.POST /graphql HTTP/1.1
Host: 127.0.0.1:12800
User-Agent: Go-http-client/1.1
Content-Length: 1241
Accept: application/json; charset=utf-8
Content-Type: application/json; charset=utf-8
Accept-Encoding: gzip

{"query":"# Licensed to Apache Software Foundation (ASF) under one or more contributor\n# license agreements. See the NOTICE file distributed with\n# this work for additional information regarding copyright\n# ownership. Apache Software Foundation (ASF) licenses this file to you under\n# the Apache License, Version 2.0 (the \"License\"); you may\n# not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing,\n# software distributed under the License is distributed on an\n# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, either express or implied.  See the License for the\n# specific language governing permissions and limitations\n# under the License.\n\nquery ($condition:TopNCondition!, $duration: Duration!) {\n    result: sortMetrics(condition: $condition, duration: $duration) {\n        name\n        value\n    }\n}\n","variables":{"condition":{"name":"top_n_cache_read_command","parentService":"GuavaCache-local","normal":false,"scope":"Service","topN":5,"order":"DES"},"duration":{"start":"2022-09-13 1853","end":"2022-09-13 1923","step":"MINUTE"}}}

19:53:18.719900 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52, bad cksum 0 (->3cc2)!)
    localhost.12800 > localhost.51771: Flags [.], cksum 0xfe28 (incorrect -> 0xe9c6), seq 221, ack 2608, win 6338, options [nop,nop,TS val 2269259431 ecr 3479306895], length 0
E..4..@.@...........2..;.u....XH.....(.....
.B&..b..
19:53:18.738493 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 200, bad cksum 0 (->3c2e)!)
    localhost.12800 > localhost.51771: Flags [P.], cksum 0xfebc (incorrect -> 0xe882), seq 221:369, ack 2608, win 6338, options [nop,nop,TS val 2269259450 ecr 3479306895], length 148
E.....@.@...........2..;.u....XH...........
.B&..b..HTTP/1.1 200 OK
content-type: application/json; charset=utf-8
content-length: 196
server: Armeria/1.18.0
date: Tue, 13 Sep 2022 11:53:18 GMT
pg-yang commented 2 years ago

UI invoke readSampledRecords method in OAP . it's different with swctl

wu-sheng commented 2 years ago

Have you checked the UI about what the slow SQL statement actually requests?

There are several concepts you missed.

  1. TopN is not a metric type, it is a record.
  2. Sorted metrics, which you are using, is not sampled record. The latter is what you should use.
  3. Slow statement is nearly impossible to test in the e2e cases. Take a look at TopNWorker mechanism, it is reported perioedly, which is controlled through SW_CORE_TOPN_REPORT_PERIOD.(10min default) You most likely face timeout if you want to test this in the e2e actually. We prefer you do manually testing locally.
pg-yang commented 2 years ago

Got it . my shell is incorrect . I misunderstood the subcommand top

swctl --display yaml --base-url=http://127.0.0.1:12800/graphql metrics -h

  top                                 query the top <n> entities sorted by the specified metrics

This top don't mean TopN record. emmm

pg-yang commented 2 years ago

If set SW_CORE_TOPN_REPORT_PERIOD small enough , Doesn't e2e work correctly ?

pg-yang commented 2 years ago

Actually , In order to test TopN . I modified the code about SW_CORE_TOPN_REPORT_PERIOD on my PC .

wu-sheng commented 2 years ago

If set SW_CORE_TOPN_REPORT_PERIOD small enough , Doesn't e2e work correctly ?

It could be, just be careful about the settings, if you generate too many traffics, and this period is too small, the storage could be overload easily.

pg-yang commented 2 years ago

I have finished the most work of this issue , I think .

Did you prepare virtual-cache doc ?

https://github.com/apache/skywalking-java/pull/303#issuecomment-1240112282

wu-sheng commented 2 years ago

Did you prepare virtual-cache doc ? https://github.com/apache/skywalking-java/pull/303#issuecomment-1240112282

What do you mean I prepare? Do you want me to write that up?

pg-yang commented 2 years ago

Sorry , I thought you want prepare it . I would rename virtual-database.md to virtual-service.md , and add content for virtual cache . rename slow-db-statement.md to slow-virtual-service-statement , add virtual cache content too

wu-sheng commented 2 years ago

Sorry , I thought you want prepare it .

If you face any trouble to write one, I would help. If not, I think it should be good to be written by you. After all, this is your PR :)

pg-yang commented 2 years ago

No trouble , I misunderstood your meaning . Haha .

wu-sheng commented 2 years ago

BTW, I think CLI doesn't support querying the sampled record. You and @yswdqz (Slow SQL - MySQL server statistic https://github.com/apache/skywalking/pull/9575) noticed this issue today.

@yswdqz told me he may want to try this, but he is not familiar with golang much. How about your golang skill? @pg-yang

pg-yang commented 2 years ago

not familiar with golang either . I read https://github.com/apache/skywalking-infra-e2e simply . The code is not a lot. I debugged this project for my test case today .

wu-sheng commented 2 years ago

not familiar with golang either . I read https://github.com/apache/skywalking-infra-e2e simply . The code is not a lot. I debugged this project for my test case today .

swctl is actually from https://github.com/apache/skywalking-cli. infra-e2e is just a scheduler to push the tests forward step by step. SkyWalking has a lot of sub-projects, sorry.

wu-sheng commented 2 years ago

@mrproliu @kezhenxu94 If you have time to help on the CLI part, it would be easier.

pg-yang commented 2 years ago

SkyWalking has a lot of sub-projects

I realized it .

e2e for trigger , verify ,etc skywalking-cli for interaction with OAP .

I would like support this feature , but i'm golang newer too . It may take some time to familiarize these projects

mrproliu commented 2 years ago

@mrproliu @kezhenxu94 If you have time to help on the CLI part, it would be easier.

Sure, I can handle it. I will add metrics sorted and metrics sampled-record commands to split the queries.

wu-sheng commented 2 years ago

@pg-yang @yswdqz https://github.com/apache/skywalking-cli/pull/163 This has been done. You could update the commit ID of the swctl.

pg-yang commented 1 year ago

Ask for help . my e2e test case failure .

error message

 WARNING  failed to verify case[swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql metrics sampled-record  --name=top_n_cache_read_command    --service-id=R3VhdmFDYWNoZS1sb2NhbA==.0 |  yq e 'to_entries | with(.[] ; .value=(.value | to_entries))' -
          ], retried 30 time(s):
 DETAILS  failed to verify the output: swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql metrics sampled-record  --name=top_n_cache_read_command    --service-id=R3VhdmFDYWNoZS1sb2NhbA==.0 |  yq e 'to_entries | with(.[] ; .value=(.value | to_entries))' -
          , error:
          failed to execute template: template: test:21:23: executing "test" at <.value>: wrong type for value; expected string; got interface {}

 SUMMARY  10 passed
          1 failed
          8 skipped

actual data

- key: 0
  value:
    - key: name
      value: getIfPresent user_1
    - key: id
      value: ""
    - key: value
      value: "5"
    - key: refid
      value: null
- key: 1
  value:
    - key: name
      value: getIfPresent user_1
    - key: id
      value: ""
    - key: value
      value: "1"
    - key: refid
      value: null
- key: 2
  value:
    - key: name
      value: getIfPresent user_1
    - key: id
      value: ""
    - key: value
      value: "1"
    - key: refid
      value: null
- key: 3
  value:
    - key: name
      value: getIfPresent user_1
    - key: id
      value: ""
    - key: value
      value: "1"
    - key: refid
      value: null
- key: 4
  value:
    - key: name
      value: getIfPresent user_1
    - key: id
      value: ""
    - key: value
      value: "0"
    - key: refid
      value: null

expect data :

{{- contains . }}
- key: 0
  value:
  {{- contains .value }}
  - key: name
    value: {{ notEmpty .value}}
  - key: id
    value:
  - key: value
    value: {{ notEmpty .value}}
  - key: refid
    value: null
  {{- end }}
- key: 1
  value:
  {{- contains .value }}
  - key: name
    value: {{ notEmpty .value}}
  - key: id
    value:
  - key: value
    value: {{ notEmpty .value}}
  - key: refid
    value: null
  {{- end }}
- key: 2
  value:
  {{- contains .value }}
  - key: name
    value: {{ notEmpty .value}}
  - key: id
    value:
  - key: value
    value: {{ notEmpty .value}}
  - key: refid
    value: null
  {{- end }}
- key: 3
  value:
  {{- contains .value }}
  - key: name
    value: {{ notEmpty .value}}
  - key: id
    value:
  - key: value
    value: {{ notEmpty .value}}
  - key: refid
    value: null
  {{- end }}
- key: 4
  value:
  {{- contains .value }}
  - key: name
    value: {{ notEmpty .value}}
  - key: id
    value:
  - key: value
    value: {{ notEmpty .value}}
  - key: refid
    value: null
  {{- end }}
{{- end }}

e2e golang stack

image

pg-yang commented 1 year ago

this command is used for querying actual data

 swctl --display yaml --base-url=http://127.0.0.1:53083/graphql metrics sampled-record --name=top_n_cache_read_command  --service-id=R3VhdmFDYWNoZS1sb2NhbA==.0  | yq e 'to_entries | with(.[] ; .value=(.value | to_entries))' -
pg-yang commented 1 year ago

Actually , I don't know the grammar such as {{- contains .value }} , {{- contains . }} ,etc . Could you tell me what's the grammar ?

wu-sheng commented 1 year ago

Check this as a good example, https://github.com/apache/skywalking/pull/9575. And e2e tool has a bug fixed.

pg-yang commented 1 year ago

I pulled e2e , and rebuild it . It's right now . Thanks .

pg-yang commented 1 year ago

The similar error caused by latest e2e.

map[interface{}]interface{}{
 -                  string("codesignature"):         string("test.apache.skywalking.e2e.profile.ProfileController.createAuthor:-1"),
 +                  string("codesignature"):         string("java.lang.Thread.sleep:-2"),
                    string("count"):                 int(1),
 -                  string("duration"):              string("<wanted gt 0, but was <interface {} Value>>"),
 +                  string("duration"):              int(0),
                    string("durationchildexcluded"): int(0),
                    string("id"):                    string("66"),
                    string("parentid"):              string("65"),
                },

https://github.com/pg-yang/skywalking/actions/runs/3065040570/jobs/4948863749

wu-sheng commented 1 year ago

I think profiling has no relative to your change?

pg-yang commented 1 year ago

yes , it's not related with my code . but I update e2e and java agent commit id to lasted

wu-sheng commented 1 year ago

It should be passed after rerun. e2e tool has been upgraded in master branch due to latest merged PR, all passed.

wu-sheng commented 1 year ago

The last step has been submitted, https://github.com/apache/skywalking-showcase/pull/77.

pg-yang commented 1 year ago

Final work . It feels like a long time

pg-yang commented 1 year ago

Other PR also failed , on similar error like

 -      string("logs"):                 nil,
 +      string("logs"):                 []interface{}{},
wu-sheng commented 1 year ago

Which PR? That error usually means logs are not collected. Maybe some tests are not stable.

wu-sheng commented 1 year ago

Final work . It feels like a long time

It wouldn't be easy to add a core feature at level one menu. Thanks for your patience.

pg-yang commented 1 year ago

Which PR? That error usually means logs are not collected. Maybe some tests are not stable.

https://github.com/apache/skywalking/pull/9632 CI was rerun by you or anyone else after it failed one time . Right ?

wu-sheng commented 1 year ago

Yes, I did.

pg-yang commented 1 year ago

There was two PR need to rerun at least , so it should be common error , maybe because some tests are not stable . Anyway , it's uncomfortable experience

wu-sheng commented 1 year ago

I don't know whether there is an error, but considering GHA provides 3u8g resources(max), keeping in stable is very hard.