an-tao / trantor

a non-blocking I/O tcp network lib based on c++14/17
Other
398 stars 134 forks source link

Trantor async API changes #313

Closed tripleslash closed 10 months ago

tripleslash commented 10 months ago
tripleslash commented 10 months ago

@an-tao I wasn't sure if this was the proper way to prevent trantor from closing the connection in case we're sending an async stream

an-tao commented 10 months ago

@tripleslash Thanks for your patch, but IMHO, The kick-off feature should not be disabled internally, instead, users should ensure that it is not disconnected by setting the idle_connection_timeout configuration option to a reasonable value. I've added the boolean flag in my PR according your advice.

tripleslash commented 10 months ago

@an-tao isnt it missing an extendLife Call in that case? The asycstream send call currently doesnt do that anywhere

tripleslash commented 10 months ago

@an-tao but i still disagree. For example i added a Log streaming endpoint with this feature but in Release Mode the log can be very quiet for several minutes and i cant guarantee any time for when new output would happen. Maybe we could add a config parameter that disables kickoff for async streams?

an-tao commented 10 months ago

We can change the API to:

virtual AsyncStreamPtr sendAsyncStream(bool disableKickoff=false) = 0;

and if the flag is true, we should enable kickoff again when the stream is closed. After mine is merged you could create a new PR for this.

tripleslash commented 10 months ago

Not sure how reenabling the kickoff works what im doing right now is removing the entry and Timing wheel ptr, how can I reset them?

An Tao @.***> schrieb am Fr., 22. Dez. 2023, 08:12:

We can change the API to:

virtual AsyncStreamPtr sendAsyncStream(bool disableKickoff=false) = 0;

and if the flag is true, we should enable kickoff again when the stream is closed. After mine is merged you could create a new PR for this.

— Reply to this email directly, view it on GitHub https://github.com/an-tao/trantor/pull/313#issuecomment-1867323920, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABATEP3H3HCM6UOCC3HIVFDYKUXG3AVCNFSM6AAAAABA6QAKSWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRXGMZDGOJSGA . You are receiving this because you were mentioned.Message ID: @.***>

tripleslash commented 10 months ago

@an-tao i changed your log message to a single line and made it warning level - this should make it easier to tell when many requests are happening at the same time (and I suppose something went wrong if the buffer still has unsent content when the tcp connection is closed)

tripleslash commented 10 months ago

@an-tao isnt it missing an extendLife Call in that case? The asycstream send call currently doesnt do that anywhere

And what about that case? In case we use disableKickoff = false, then the AsyncStream send() method should cause an extendLife call somewhere down the chain?

tripleslash commented 10 months ago

@an-tao Can you look over my latest commit and check if I missed something when reenabling this? I tried to implement the disableKickoff parameter feature as you said

an-tao commented 10 months ago

@an-tao Can you look over my latest commit and check if I missed something when reenabling this? I tried to implement the disableKickoff parameter feature as you said

LGTM And I changed the position calling extendLife(), you could rebase it and remove the high layer calling.

tripleslash commented 10 months ago

@an-tao extendLife is only called from within the loop, right? So what is the convention here? Shouldn't it call timingWheel->insertEntryInloop then?

an-tao commented 10 months ago

@an-tao extendLife is only called from within the loop, right? So what is the convention here? Shouldn't it call timingWheel->insertEntryInloop then?

Yes, It's increase the reference count of the kickoff-entry. I put it in three position: in the read Callback,in the writeRow method, and after calling the sendfile() syscall. new edit: After some investigation, I thought extendLife() might be called from a different loop to the timing wheel.

an-tao commented 10 months ago

Let's catch the logs when TLS hanging happens. @tripleslash

tripleslash commented 10 months ago

Let's catch the logs when TLS hanging happens. @tripleslash

I'll try to do some more testing later

tripleslash commented 10 months ago

@an-tao

20231223 13:40:52.998560 UTC 390 DEBUG [~TcpConnectionImpl] write node list size: 1 first node is file? 1 first node is stream? 0 first node is async? 0 first node size: 501866 buffered TLS data size: 12058 - TcpConnectionImpl.cc:108

This is what the log prints when the error occurs

an-tao commented 10 months ago

@an-tao

20231223 13:40:52.998560 UTC 390 DEBUG [~TcpConnectionImpl] write node list size: 1 first node is file? 1 first node is stream? 0 first node is async? 0 first node size: 501866 buffered TLS data size: 12058 - TcpConnectionImpl.cc:108

This is what the log prints when the error occurs

This means there is some data in the tls provider buffer but the connection doesn't enable writing or the provider can't send the buffered data. please find related logs before the error. such as:

image

image

tripleslash commented 10 months ago

The line I've sent you is the only thing that is being logged to drogon.log, meaning those line don't appear anywhere in the log

an-tao commented 10 months ago

The line I've sent you is the only thing that is being logged to drogon.log, meaning those line don't appear anywhere in the log

Can you find any clues in the logs that might be relevant? any errors reported by the connection or tls provider?

an-tao commented 10 months ago

What's the idle timeout did you set? is it a mistake kick-off? How long did it take to send this large size response? Is it possible that it is an underlying network problem?

tripleslash commented 10 months ago

The line I've sent you is the only thing that is being logged to drogon.log, meaning those line don't appear anywhere in the log

Can you find any clues in the logs that might be relevant? any errors reported by the connection or tls provider?

Nothing unfortunately. I will compile this later with TRACE logging on to see if that logs any useful information

What's the idle timeout did you set? is it a mistake kick-off? How long did it take to send this large size response? Is it possible that it is an underlying network problem?

It is not very large, the file is only 395 kb in size. It is part of a javascript bundle file and when my angular UI loads many such javascript bundle files are loaded in parallel by the browser (Chrome Desktop) and sometimes one of the requests fails with the socket being closed before all data is sent....

Are you handling all possible socket error codes like EAGAIN, EWOULDBLOCK, ENOBUFS, EINTR (maybe)

tripleslash commented 10 months ago

What's the idle timeout did you set? is it a mistake kick-off? How long did it take to send this large size response? Is it possible that it is an underlying network problem?

I left it at the default, which I believe is something like 30 seconds ? The file loads usually in 65 milliseconds. Is it possible this is a threading problem? I havent been able to reproduce this when I just fetch the file one by one. The error only happens when the Browser loads several files (> 30 javascript files at once)

an-tao commented 10 months ago

What's the idle timeout did you set? is it a mistake kick-off? How long did it take to send this large size response? Is it possible that it is an underlying network problem?

I left it at the default, which I believe is something like 30 seconds ? The file loads usually in 65 milliseconds

Is your OS 64-bit or 32-bit? Please compile drogon and your application in debug mode and set the log level to TRACE to get more logs.

an-tao commented 10 months ago

What's the idle timeout did you set? is it a mistake kick-off? How long did it take to send this large size response? Is it possible that it is an underlying network problem?

I left it at the default, which I believe is something like 30 seconds ? The file loads usually in 65 milliseconds. Is it possible this is a threading problem? I havent been able to reproduce this when I just fetch the file one by one. The error only happens when the Browser loads several files (> 30 javascript files at once)

It's too late here, I'll try to use wrk tomorrow to download files over https over multiple connections, I hope I can reproduce the issue locally.

an-tao commented 10 months ago

@tripleslash I use the press comand of drogon_ctl to download a file in 100 connections over https, the server works fine in my local environment.

tripleslash commented 10 months ago

@an-tao are you connecting from localhost? To reproduce this you should forward the HTTPS port in your router and connect from outside your local network

an-tao commented 10 months ago

@an-tao are you connecting from localhost? To reproduce this you should forward the HTTPS port in your router and connect from outside your local network

Ok, I'll try this way.

an-tao commented 10 months ago

@an-tao are you connecting from localhost? To reproduce this you should forward the HTTPS port in your router and connect from outside your local network

I tested the downloading over internet, it also works fine. logs

tripleslash commented 10 months ago

Not sure why I am consistently reproducing this bug then. I am on OpenSSL 3.1.4

an-tao commented 10 months ago

Not sure why I am consistently reproducing this bug then. I am on OpenSSL 3.1.4

Can you use the same command line to test your server side?

an-tao commented 10 months ago

@tripleslash Happy new year! Have u made some new tests for the TLS hanging issue?