Closed tripleslash closed 10 months ago
@an-tao I wasn't sure if this was the proper way to prevent trantor from closing the connection in case we're sending an async stream
@tripleslash Thanks for your patch, but IMHO, The kick-off feature should not be disabled internally, instead, users should ensure that it is not disconnected by setting the idle_connection_timeout configuration option to a reasonable value. I've added the boolean flag in my PR according your advice.
@an-tao isnt it missing an extendLife Call in that case? The asycstream send call currently doesnt do that anywhere
@an-tao but i still disagree. For example i added a Log streaming endpoint with this feature but in Release Mode the log can be very quiet for several minutes and i cant guarantee any time for when new output would happen. Maybe we could add a config parameter that disables kickoff for async streams?
We can change the API to:
virtual AsyncStreamPtr sendAsyncStream(bool disableKickoff=false) = 0;
and if the flag is true, we should enable kickoff again when the stream is closed. After mine is merged you could create a new PR for this.
Not sure how reenabling the kickoff works what im doing right now is removing the entry and Timing wheel ptr, how can I reset them?
An Tao @.***> schrieb am Fr., 22. Dez. 2023, 08:12:
We can change the API to:
virtual AsyncStreamPtr sendAsyncStream(bool disableKickoff=false) = 0;
and if the flag is true, we should enable kickoff again when the stream is closed. After mine is merged you could create a new PR for this.
— Reply to this email directly, view it on GitHub https://github.com/an-tao/trantor/pull/313#issuecomment-1867323920, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABATEP3H3HCM6UOCC3HIVFDYKUXG3AVCNFSM6AAAAABA6QAKSWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRXGMZDGOJSGA . You are receiving this because you were mentioned.Message ID: @.***>
@an-tao i changed your log message to a single line and made it warning level - this should make it easier to tell when many requests are happening at the same time (and I suppose something went wrong if the buffer still has unsent content when the tcp connection is closed)
@an-tao isnt it missing an extendLife Call in that case? The asycstream send call currently doesnt do that anywhere
And what about that case? In case we use disableKickoff = false, then the AsyncStream send() method should cause an extendLife call somewhere down the chain?
@an-tao Can you look over my latest commit and check if I missed something when reenabling this? I tried to implement the disableKickoff parameter feature as you said
@an-tao Can you look over my latest commit and check if I missed something when reenabling this? I tried to implement the disableKickoff parameter feature as you said
LGTM And I changed the position calling extendLife(), you could rebase it and remove the high layer calling.
@an-tao extendLife is only called from within the loop, right? So what is the convention here? Shouldn't it call timingWheel->insertEntryInloop then?
@an-tao extendLife is only called from within the loop, right? So what is the convention here? Shouldn't it call timingWheel->insertEntryInloop then?
Yes, It's increase the reference count of the kickoff-entry. I put it in three position: in the read Callback,in the writeRow method, and after calling the sendfile() syscall.
new edit: After some investigation, I thought extendLife()
might be called from a different loop to the timing wheel.
Let's catch the logs when TLS hanging happens. @tripleslash
Let's catch the logs when TLS hanging happens. @tripleslash
I'll try to do some more testing later
@an-tao
20231223 13:40:52.998560 UTC 390 DEBUG [~TcpConnectionImpl] write node list size: 1 first node is file? 1 first node is stream? 0 first node is async? 0 first node size: 501866 buffered TLS data size: 12058 - TcpConnectionImpl.cc:108
This is what the log prints when the error occurs
@an-tao
20231223 13:40:52.998560 UTC 390 DEBUG [~TcpConnectionImpl] write node list size: 1 first node is file? 1 first node is stream? 0 first node is async? 0 first node size: 501866 buffered TLS data size: 12058 - TcpConnectionImpl.cc:108
This is what the log prints when the error occurs
This means there is some data in the tls provider buffer but the connection doesn't enable writing or the provider can't send the buffered data. please find related logs before the error. such as:
The line I've sent you is the only thing that is being logged to drogon.log, meaning those line don't appear anywhere in the log
The line I've sent you is the only thing that is being logged to drogon.log, meaning those line don't appear anywhere in the log
Can you find any clues in the logs that might be relevant? any errors reported by the connection or tls provider?
What's the idle timeout did you set? is it a mistake kick-off
?
How long did it take to send this large size response? Is it possible that it is an underlying network problem?
The line I've sent you is the only thing that is being logged to drogon.log, meaning those line don't appear anywhere in the log
Can you find any clues in the logs that might be relevant? any errors reported by the connection or tls provider?
Nothing unfortunately. I will compile this later with TRACE logging on to see if that logs any useful information
What's the idle timeout did you set? is it a mistake
kick-off
? How long did it take to send this large size response? Is it possible that it is an underlying network problem?
It is not very large, the file is only 395 kb in size. It is part of a javascript bundle file and when my angular UI loads many such javascript bundle files are loaded in parallel by the browser (Chrome Desktop) and sometimes one of the requests fails with the socket being closed before all data is sent....
Are you handling all possible socket error codes like EAGAIN, EWOULDBLOCK, ENOBUFS, EINTR (maybe)
What's the idle timeout did you set? is it a mistake
kick-off
? How long did it take to send this large size response? Is it possible that it is an underlying network problem?
I left it at the default, which I believe is something like 30 seconds ? The file loads usually in 65 milliseconds. Is it possible this is a threading problem? I havent been able to reproduce this when I just fetch the file one by one. The error only happens when the Browser loads several files (> 30 javascript files at once)
What's the idle timeout did you set? is it a mistake
kick-off
? How long did it take to send this large size response? Is it possible that it is an underlying network problem?I left it at the default, which I believe is something like 30 seconds ? The file loads usually in 65 milliseconds
Is your OS 64-bit or 32-bit? Please compile drogon and your application in debug mode and set the log level to TRACE to get more logs.
What's the idle timeout did you set? is it a mistake
kick-off
? How long did it take to send this large size response? Is it possible that it is an underlying network problem?I left it at the default, which I believe is something like 30 seconds ? The file loads usually in 65 milliseconds. Is it possible this is a threading problem? I havent been able to reproduce this when I just fetch the file one by one. The error only happens when the Browser loads several files (> 30 javascript files at once)
It's too late here, I'll try to use wrk tomorrow to download files over https over multiple connections, I hope I can reproduce the issue locally.
@tripleslash I use the press comand of drogon_ctl to download a file in 100 connections over https, the server works fine in my local environment.
@an-tao are you connecting from localhost? To reproduce this you should forward the HTTPS port in your router and connect from outside your local network
@an-tao are you connecting from localhost? To reproduce this you should forward the HTTPS port in your router and connect from outside your local network
Ok, I'll try this way.
@an-tao are you connecting from localhost? To reproduce this you should forward the HTTPS port in your router and connect from outside your local network
I tested the downloading over internet, it also works fine.
Not sure why I am consistently reproducing this bug then. I am on OpenSSL 3.1.4
Not sure why I am consistently reproducing this bug then. I am on OpenSSL 3.1.4
Can you use the same command line to test your server side?
@tripleslash Happy new year! Have u made some new tests for the TLS hanging issue?