[Bug] websocket::connection::read does not interleave with websocket::connection::send

0x00002a commented 3 years ago

This is mostly for tracking. I'm currently working on a fix and have a test which repros it (ish, see below)

Currently send's are queued up and executed in order by the websocket connection, making it much easier for users since they do not need to worry about manually syncing them. However, read is not interleaved or indeed queued at all, which is somewhat suprisingly behaviour (to me anyway on encountering it). The result of read not being properly interleaved is that it is possible to call send and then read but have the read block waiting for messages before the send ever goes through (which can result in both peers trying to read and the connection shutting down).

Due to the nature of the bug (relies on send not being dispatched before the read call is hit), its a pain to reproduce, and the test I've created only reproduces it some of the time.

0x00002a commented 3 years ago

I'm not 100% on this. It seems it is valid for there to be an async read/write pair active at the same time (according to the comment linked below). What I said above about the read coming in before the write is true, but if there can be a read and write active it can't be the issue. ~~Also having both peers reading is perfectly fine, at least in my usage of it (I'm guessing beast sends keep alive stuff automatically or we set it somewhere?).~~ nvm, it will timeout unless pingpong messages are sent. Not sure if this should be added to connection as well?

It is still a problem to call read multiple times before the last one finished whereas send can be called as many times as needed and will queue. I think the main advantage to queueing reads as well as sends is that it makes it easier on the user, they wouldn't have to worry about ensuring nobody calls read while one is active. My "fix" for this currently also queues connect, disconnect, and accept as well meaning the user wouldn't have to worry about trying to send messages before being connected or accepting (since the user may use connection outside of the controller context on the clientside and on server side it is waiting for accept).

If this is still valid, and I've interpreted it right, it might be better to have two queues, one for reads and one for writes. But that increases the complexity by quite a bit and I can't think of a use case for it.

Thoughts?

0x00002a commented 3 years ago

I vote for adding this (the queueing of all everything that is). I also suggest that if we don't add it, we should remove it from send as otherwise its just confusing. imo for now having the same read and write queue seems to work, and it can be improved upon later if there's a case where the added complexity is worth it.

We might want to also offer some sort of "ejection" mechanism, where the user can get access to the websocket::stream (via an rvalue method) so if they really want to bypass all the "overhead" of features such as this they can. If we did this though stream would certainly need some improvement, like supporting asio executors (so the user could use coroutines with it for example).

Tectu commented 3 years ago

I'm sorry, I will get back to you on this shortly - I'm not ignoring it!

Tectu commented 3 years ago

Sorry for not getting back on this sooner. I really needed to implement multipart/form-data support 🤦

In general I think having a command queue is a good design choice as for some of the reasons you outlined/mentioned. However, there are some caveats in my opinion. Currently I am mainly thinking of disconnect(): I'm not sure whether it's a good idea to force the disconnect action to pass through the queue. disconnect() should in my opinion close the connection immediately and empty the queue. But then again, that leads to thinking that there are situations where a user might want to queue multiple send/write actions and then disconnect AFTER everything was written. Should this behavior be achieved by having the disconnect action go through the queue and then manually bypassing the queue if disconnect should happen immediately or the other way around?

Doesn't seem like a simple design choice to me :p

But I do heavily agree that we should have consistency between read & write operations. Either both have a queue or neither (the queue might be the same for both read & write operations (along with the remaining actions/operations).

0x00002a commented 3 years ago

Currently I am mainly thinking of disconnect(): I'm not sure whether it's a good idea to force the disconnect action to pass through the queue

Didn't think of that one, good point.

imo disconnect should be queued because it makes sense with the rest of the interface, and another method like force_disconnect should be added which does the unqueued version. I'm not sure if we should clear the queue, I think it will essentially clear itself by erroring out with error::closed which might be more intuitive behaviour.

Also, it is ok to call async_close while other async operations are active:

Like a regular Boost.Asio socket, a stream is not thread safe. Callers are responsible for synchronizing operations on the socket using an implicit or explicit strand, as per the Asio documentation. The websocket stream asynchronous interface supports one of each of the following operations to be active at the same time: async_read or async_read_some async_write or async_write_some async_ping or async_pong async_close

(beast docs)

Tectu commented 3 years ago

Having disconnect() go through the queue and having an additional force_disconnect() which bypasses the queue sounds like a good idea to me.

I take it that we must not have more than one async_close() taking place at the same time but I think in this case we have to leave it up to the user not to call force_disconnect() while there is already a disconnect() operation in the queue (as timing might be bad and they end up issuing async_close() simultaneously.

0x00002a commented 3 years ago

I've got the branch fix-ws-interleaved-io which needs some work and updating based on this discussion, but I can work on it when I get back or possibly this week.

I've been experimenting with coroutines recently and I think I can make a "queued actions" container, which would greatly simplify adding queues for each action type (read, write, close, etc). Currently this is impeded by the fact that the async callback has to tell the next action to execute but only after its completely done, which is simply not possible to wrap with nested callbacks, unless I'm missing something. If I did end up using coroutines they would not need to be exposed in the interface but they would have to be used for all the stuff wrapped (async_read, async_write, etc). Is it acceptable to add coroutine code to malloy currently?

Tectu commented 3 years ago

Last time I checked compiler support for coroutines were pretty... bad. However, it seems like both GCC 10 and MSVC 19.28 have support. Clang is listed as partial - I have no idea what that actually means in practise.

I definitely see the beauty of using coroutines in malloy. It would make many things... better. Probably even by a long shot. My requirement is still that the library can be built using MinGW-w64 (eg. mingw-w64-x86_64-gcc from MSYS2 which is currently GCC 10.3). If there's enough compiler support to have everything built with:

MinGW-w64 GCC 10.3
Some decent recent version of MSVC I'm pro moving to coroutines. In that case I might want to wait with the v0.1 release until we have everything changed over to coroutines.

0x00002a commented 3 years ago

I'm pro moving to coroutines. In that case I might want to wait with the v0.1 release until we have everything changed over to coroutines.

Moving everything to coroutines might be quite the job. Mostly thinking of the problems mentioned in #40. Also we might have to roll a few of our own coroutines as asio is rather lacking there (awaitable seems to be it, and that doesn't support co_return even).

I'm personally leaning more toward having coroutines (in the interface) in the release after 0.1 as a major API change since I think theres a few parts which could be improved with coroutines (e.g. client::controller::ws_connect could return a connection again, and server handlers could optionally be coroutines). This probably should be its own issue though :p

0x00002a commented 3 years ago

Update on this, its on my todo list but I'm very busy currently, so it'll probably be a few days before I can finish this off

0x00002a commented 3 years ago

So I'm almost done with this, but I've run into a pretty major issue. clang is not compatible with libstdc++'s coroutine header (llvm bug report). The linked issue does have a hack to that might allow it to work but its a pretty big "might", and I don't like the idea of putting it in a library header.

The other option is using libc++. Unfortunately as discussed in #46, libc++ does not support concepts until v13 (which isn't released yet). So we're kinda stuck.

Personally I quite like the idea of just dropping clang support until 13 is released and/or it catches up with gcc or msvc in terms of supported features. This isn't great as it means clang based tools, which is most of C++ tooling afaik, are unreliable/don't work at all.

@Tectu Do you currently need clang support? iirc you need it to replace mingw?

Tectu commented 3 years ago

lol, this is hideous.

I am currently not relying on malloy being able to compile with clang < 13.0. As of today, I will not be able to get rid of the need to have malloy work with MinGW-w64 / MSYS2. Anything that won't built in that configuration is something I can't currently pull into the project. Lets hope for GCC 11 to be released soon...

I do need to build malloy on FreeBSD but I can use llvm-devel or alternatively just gcc which works fine.

This hurts, it really does. It feels wrong - very much so. But I do think that dropping clang support until 13.0 is available is reasonable in this shituation.

0x00002a commented 3 years ago

Anything that won't built in that configuration is something I can't currently pull into the project

Just to be clear, are you talking about malloy here? iirc the CI was switched to clang because mingw kept running out of space for identifier names (or that was my best guess). As far as I'm aware, that won't be fixed by more modern versions of gcc, its a bug in the mingw layer. I feel I may have messed up here as I removed mingw from the CI in #51.

Tectu commented 3 years ago

Yes, I am talking about malloy. I understand that the MSYS2 CI was adjusted. However, building locally I am not running into these issues and as of today I am able to build the current main and app_fw branches without any problems on Windows with MinGW-w64 from MSYS2.

Moving everything (i.e. the applications consuming malloy - which are in three cases huge code bases) from MinGW-w64 to clang turned out to be unfeasible in the current situation. Therefore, I am stuck with having to build everything using MinGW-w64 on Windows.

Changes which break this behavior would be a problem for me.

I feel I may have messed up here as I removed mingw from the CI in #51.

Nah, you didn't mess up anything in my opinion. Your contributions to this project have great value and are highly appreciated :)

0x00002a commented 3 years ago

Nah, you didn't mess up anything in my opinion. Your contributions to this project have great value and are highly appreciated :)

Well thank you, I appreciate it. Turns out I didn't turn off the mingw CI after all, I just relabelled it as clang instead, presumably because I was going to switch it and then didn't :p.

In other news, I've run into another few issues. After a few hours of staring at link errors I found this buried in asio: https://github.com/boostorg/asio/commit/2e7b0be5f73f7de0b2560643d9ea95d65db89b7b, which stops it building on boost < 1.76 + msvc and also this problem in gcc 10.3, which I can't reproduce but which matches the link errors shown by the CI, and stops it building on msys/mingw.

I'm starting to wonder if it might be better to implement a hacky solution now, which doesn't use coroutines but allows easy upgrading to them, rather than trying to get them to work with the current implementations.

Tectu commented 3 years ago

I guess that is what we get for dealing with "bleeding edge" C++ 🙈 Also seems like we roll back to my initial statement from a few weeks ago that when I looked into using coroutines things didn't work out well becasue of compiler support (and the various connected issues that you encountered & listed one by one).

How hacky would the hacky solution be?

0x00002a commented 3 years ago

How hacky would the hacky solution be?

Off the top of my head, I was thinking the callbacks could be passed a function to call when done. Its not great because its an obscure runtime bug waiting to happen if a line of code is forgotten but its not terrible imo

Tectu commented 3 years ago

Well, this is a C++ library after all so preventing the user from shooting him or herself in the foot is not always possible. I'd say document it very, very well and then we do this until we find ourselves in a moment where coroutines support is good enough to migrate.

Tectu / malloy

[Bug] websocket::connection::read does not interleave with websocket::connection::send #70