eclipse-vertx / vertx-sql-client

High performance reactive SQL Client written in Java
Apache License 2.0
892 stars 201 forks source link

Inconsistent Tracing Results with Different Vert.x Routing Approaches #1171

Closed Olamshin closed 2 years ago

Olamshin commented 2 years ago

Version

vert.x 4.2.5 vertx-opentelemetry 4.2.5

Context

I am seeing an issue that is an amalgam of at least three libraries. I will start here.

When I wire routes in vert.x in two different ways, I get different numbers of spans(opentelemetry) exported. The REST call to the reproducer should make two database calls to postgres, a COUNT and a SELECT ALL. Three spans should be generated by OpenTelemetry; the COUNT, SELECT ALL and the GET. I am getting [SELECT ALL, GET] or [SELECT ALL, GET, COUNT] depending on how the route is wired. When GET http://localhost:8888/api/v1/bookz is called, two spans below are generated. These outputs are logged to the console for ease.

Mar 31, 2022 1:03:56 PM io.vertx.ext.web.handler.impl.LoggerHandlerImpl
INFO: 0:0:0:0:0:0:0:1 - - [Thu, 31 Mar 2022 18:03:56 GMT] "GET /api/v1/bookz HTTP/1.1" 200 6118 "-" "PostmanRuntime/7.29.0"
Mar 31, 2022 1:03:59 PM io.opentelemetry.exporter.logging.LoggingSpanExporter export
INFO: 'Query' : dade453425d22eee5c9b7a5fa423d280 444cbbf1ba3ceca0 CLIENT [tracer: io.vertx:] AttributesMap{data={span.kind=client, db.statement=SELECT * FROM books LIMIT $1 OFFSET $2, db.user=postgres, db.type=sql, db.instance=books, peer.address=localhost:5432}, capacity=128, totalAddedValues=6}
Mar 31, 2022 1:03:59 PM io.opentelemetry.exporter.logging.LoggingSpanExporter export
INFO: 'GET' : dade453425d22eee5c9b7a5fa423d280 18db58073b1e2209 SERVER [tracer: io.vertx:] AttributesMap{data={http.status_code=200, http.method=GET, http.url=http://localhost:8888/api/v1/bookz}, capacity=128, totalAddedValues=3}

When GET http://localhost:8888/api/v1/books is called, three spans are generated.

Mar 31, 2022 1:16:45 PM io.vertx.ext.web.handler.impl.LoggerHandlerImpl
INFO: 0:0:0:0:0:0:0:1 - - [Thu, 31 Mar 2022 18:16:45 GMT] "GET /api/v1/books HTTP/1.1" 200 6118 "-" "PostmanRuntime/7.29.0"
Mar 31, 2022 1:16:49 PM io.opentelemetry.exporter.logging.LoggingSpanExporter export
INFO: 'Query' : 9bca2bfa2bffa8bef0453900e63970fa dcd2332751985d82 CLIENT [tracer: io.vertx:] AttributesMap{data={span.kind=client, db.statement=SELECT * FROM books LIMIT $1 OFFSET $2, db.user=postgres, db.type=sql, db.instance=books, peer.address=localhost:5432}, capacity=128, totalAddedValues=6}
Mar 31, 2022 1:16:49 PM io.opentelemetry.exporter.logging.LoggingSpanExporter export
INFO: 'GET' : 9bca2bfa2bffa8bef0453900e63970fa 810f91bedc2d2d7c SERVER [tracer: io.vertx:] AttributesMap{data={http.method=GET, http.url=http://localhost:8888/api/v1/books}, capacity=128, totalAddedValues=2}
Mar 31, 2022 1:16:49 PM io.opentelemetry.exporter.logging.LoggingSpanExporter export
INFO: 'Query' : 9bca2bfa2bffa8bef0453900e63970fa 5558f0098e060141 CLIENT [tracer: io.vertx:] AttributesMap{data={span.kind=client, db.statement=SELECT COUNT(*) AS total FROM books, db.user=postgres, db.type=sql, http.status_code=200, db.instance=books, peer.address=localhost:5432}, capacity=128, totalAddedValues=7}

My investigation so far shows that the call chain is incorrect when two spans are exported instead of three. A span has a scope i.e. a beginning and an end. The span for the GET should encapsulate the spans for the COUNT & SELECT ALL. The issue is that the span for GET is ended before the span for COUNT is concluded. The span for COUNT is discarded since its parent span is already concluded. According to code in QueryResultBuilder, the handler for the QueryResultBuilder is executed before metrics and tracing is terminated. I believe that due to the way /bookz is routed, RoutingContext.response.end is being called before the span for COUNT is concluded.

Please advise as I am new to vert.x, I am not sure how to correct the call chain to return three spans while preserving the RoutingContext.response.end call.

Do you have a reproducer?

Steps to reproduce

  1. run docker-compose up to start containers for the database
  2. mvn package -DskipTests to build the package
  3. java -jar target/vertx-4-reactive-rest-api-0.1-SNAPSHOT-fat.jar to start the service
  4. call GET http://localhost:8888/api/v1/books
  5. observe console output
  6. call GET http://localhost:8888/api/v1/bookz
  7. observe console output
vietj commented 2 years ago

thanks for your investigation

vietj commented 2 years ago

I tried the reproducer and I don't get the same output as yours

Apr 01, 2022 10:37:21 AM io.vertx.ext.web.handler.impl.LoggerHandlerImpl
INFO: 127.0.0.1 - - [Fri, 1 Apr 2022 08:37:21 GMT] "GET /api/v1/books HTTP/1.1" 200 6118 "-" "curl/7.64.1"
Apr 01, 2022 10:37:42 AM io.vertx.ext.web.handler.impl.LoggerHandlerImpl
INFO: 127.0.0.1 - - [Fri, 1 Apr 2022 08:37:42 GMT] "GET /api/v1/bookz HTTP/1.1" 200 6118 "-" "curl/7.64.1"
Olamshin commented 2 years ago

My bad. I have corrected the reproducer, please pull the latest master and try again. I did not set the entrypoint to the Main method.

julianladisch commented 2 years ago

Calling the replyHandler after the transaction has been closed fixes the issue for me.

    public void readAll(String p, String l, Handler<AsyncResult<BookGetAllResponse>> replyHandler) {
        dbClient.withTransaction(
                        connection -> {
                            final int page = QueryUtils.getPage(p);
                            final int limit = QueryUtils.getLimit(l);
                            final int offset = QueryUtils.getOffset(page, limit);

                            return bookRepository.count(connection)
                                    .compose(total ->
                                            bookRepository.selectAll(connection, limit, offset)
                                                    .map(result -> {
                                                        final List<BookGetByIdResponse> books = result.stream()
                                                                .map(BookGetByIdResponse::new)
                                                                .collect(Collectors.toList());

                                                        return new BookGetAllResponse(total, limit, page, books);
                                                    })
                                    );
                        })
        .onComplete(replyHandler::handle);
    }
Olamshin commented 2 years ago

@vietj can you confirm that the reproducer works for you now?

vietj commented 2 years ago

I haven't gotten back to this issue yet

vietj commented 2 years ago

have you tried with an actual open telemetry server to see the final result ?

I believe that the current behaviour is not un-usual and open telemetry server are handling this case and merge result when they are provided out of order.

vietj commented 2 years ago

I think there might be a bug in vertx-otel actually

vietj commented 2 years ago

I did add my own logger and I can read:

Query id=fab790ed6a74356a parentId=6c3ec15165a135b1 attrs=AttributesMap{data={db.user=postgres, db.type=sql, db.instance=books, span.kind=client, db.statement=SELECT * FROM books LIMIT $1 OFFSET $2, peer.address=localhost:5432}, capacity=128, totalAddedValues=6}
GET id=2245993aa362a05d parentId=0000000000000000 attrs=AttributesMap{data={http.url=http://localhost:8888/api/v1/books, http.method=GET}, capacity=128, totalAddedValues=2}
Query id=6c3ec15165a135b1 parentId=2245993aa362a05d attrs=AttributesMap{data={db.user=postgres, db.type=sql, http.status_code=200, db.instance=books, span.kind=client, db.statement=SELECT COUNT(*) AS total FROM books, peer.address=localhost:5432}, capacity=128, totalAddedValues=7}

and I think this is wrong, the Query should all refer to the GET as parent and instead one of the Query refers to the other Query.

I think it's a bug in vertx-otel, I'll try to provide a fix soon.

vietj commented 2 years ago

https://github.com/eclipse-vertx/vertx-tracing/issues/41

vietj commented 2 years ago

I closed to invalid in favour of https://github.com/eclipse-vertx/vertx-tracing/issues/41

Olamshin commented 2 years ago

@vietj Thank you! Using the snapshot of vertx-otel fixed this issue and some other context propagation issues i couldn't define properly.