Non-deterministic :bad_match and :function_clause errors (read_timeout)

dfrese commented 7 months ago

Hi there,

occasionally I get :function_clause and :bad_match errors (with the latter exposing data from the db!). It can usually be remedied by increasing the :read_timeout parameter, indicating the Oracle server is just sometimes a little faster, sometimes slower.

But isn't that a bug? It seems to me, that the Erlang code tries to decode a partial message after the socket recv times out, leading to failures in attempting to decode it. The second case here seems suspicious to me: Returning :ok on timeout:

recv(read_timeout, Socket, Length, {_Tout, ReadTout} = Touts, Acc, Data) ->
    case sock_recv(Socket, 0, ReadTout) of
        {ok, NetworkData} ->
            recv(Socket, Length, Touts, <<Acc/bits, NetworkData/bits>>, Data);
        {error, timeout} ->
            {ok, ?TNS_DATA, Data};
        {error, Reason} ->
            {error, socket, Reason}
    end.

The actual error location is usually somewhere in the 'decoder', but it is hidden by the try-catch in jamdb_oracle.ex/sql_query

Maybe there is a reason for that, but shouldn't it return an error? Just removing the second case maybe?
Why is there a separate read_timeout in addition to the overall timeout anyway? If multiple socket reads are needed (because of the fetch size, I guess), why not just calculate an absolute 'end time' from the overall query timeout at the start, and then give each socket recv a remaining relative time - giving up if that is <= 0. In that way, I can say for each Query how long I want to wait, regardless of the fetch size or the way the communication with the server is split. I might want to use different timeouts for different queries, so setting a global read_timeout that serves all is not very practical.

vstavskyi commented 7 months ago

Packet header unfortunately doesn't contain PacketSize for long packets > 8k PacketSize is always <= 8k (for ethernet)

recv() loop is executed until timeout, driver can't endlessly wait last packet's part

PacketSize is here

decode_packet(<<PacketSize:16, _PacketFlags:16, ?TNS_DATA, _Flags:8, 0:16, _DataFlags:16, Rest/bits>>

Add erlang:display and you'll see where break occurs

decode_token(<<Token, Data/binary>>, Acc) ->
    erlang:display({Token, Data,Acc}),
    case Token of

vstavskyi commented 7 months ago

Please test stage branch. Protocol version is different. The header now contains dataflags.

decode_packet(<<PacketSize:32, ?TNS_DATA, _Flags:8, 0:16, DataFlags:16, Rest/bits>>) ->
    BodySize = PacketSize-10,
    case Rest of
        <<PacketBody:BodySize/binary, Rest2/bits>> when DataFlags =:= 0 ->
            {more, ?TNS_DATA, PacketBody, Rest2};
        <<PacketBody:BodySize/binary, Rest2/bits>> when DataFlags =:= 8192 ->
            {ok, ?TNS_DATA, PacketBody, Rest2};

dfrese commented 7 months ago

Packet header unfortunately doesn't contain PacketSize for long packets > 8k PacketSize is always <= 8k (for ethernet)

I don't quite understand; gen_tcp:recv is always called with length=0 as far as I can see. So as long as it is only called when at least one byte is definitely still missing, I don't see the problem. You don't have to "detect" the end of a message via the timeout, do you?

recv() loop is executed until timeout, driver can't endlessly wait last packet's part

Sure. It should wait for the overall/total timeout chosen by the user minus the already elapsed time. (unless the user gives a timeout of :infinity)

vstavskyi commented 7 months ago

0 means that all available bytes are returned. All bytes of one packet. After each data packet of maximal size (or marker packet) recv(read_timeout) receives next packet or aborts by timeout.

dfrese commented 7 months ago

0 means that all available bytes are returned. All bytes of one packet. After each data packet of maximal size (or marker packet) recv(read_timeout) receives next packet or aborts by timeout.

Sure. But I still don't understand why it's necessary to try to decode a packet after a timeout occured. (assuming the timeout would correspond to the query timeout specified by the user; instead of being a fixed global value)

erlang:display({Token, Data,Acc}),

I've seen it with a TTI_RXH token, if that helps. (It only happens with a large 'FETCH' params and a small 'READ_TIMEOUT' btw.)

Please test stage branch. Protocol version is different.

Getting timeouts (DBConnection.ConnectionError) :timeout now for any query, and with small fetch size and large read-timeout. handle_login seems to succeed though. With Oracle 19c, that is.

vstavskyi commented 7 months ago

Getting timeouts

fix

It seems dataflags are not empty only with 23c :(

dfrese commented 7 months ago

That kind of helped. But now it actually waits for the full read_timeout in normal operations! I.e. it gets reaaaally slow with a read_timeout of 500ms; even the connect fails with a read_timeout of 3000. That wasn't the case before. Before, I only got timeouts with complex queries and a slow server.

So this made things much worse, I dare to say.

vstavskyi commented 7 months ago

stage23c is slow with 19c Don't use it.

Try add parameter socket_options with recbuf parameters: [... read_timeout: 500, socket_options: [{:recbuf, 8192}] ...] If sdu=8192 (default) then recbuf=8192 or you can use other values 4096,2048,...

Looks better commit

dfrese commented 7 months ago

Try add parameter socket_options with recbuf parameters: [... read_timeout: 500, socket_options: [{:recbuf, 8192}] ...] If sdu=8192 (default) then recbuf=8192 or you can use other values 4096,2048,...

Why?

I have removed the :error -> :ok clause in my fork (pre-stage23c) and use a read_timeout of 3 seconds. That works ok for me. It's not ideal, because I have basically no control over the maximum time a query can take. But smaller and larger queries work ok, if the server is fast enough. And in my usecase it's not a huge problem, as long as a query doesn't fail after only 500ms, because the server needs that quite often. (And because after 5 seconds, a Phoenix UI will usually fail anyway).

Overall I just wanted to make a suggestion for improvement. I don't see why the read_timeout parameter is necessary at all. It doesn't correspond to any external/high level behaviour of running queries (*), nor does it affect the server, as far as I understand it.

(* something similar might make sense for stream() queries; but not for ordinary queries. And Ecto.Repo.stream() has a separate max_rows parameter anyway, and probably a default semantic for what the query timeout means for stream() queries)

dfrese commented 6 months ago

Hi,

even with my changes (not trying to decode after a socket timeout), I get sporadic :bad_match errors in production. They come in two forms, and start like

{:badmatch, <<0, 0, 1, 10, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 61, 0, 1, 1, 0, 0, 0, 0, 25, 79, 82, 65 ... >>}
{:badmatch, <<1, 10, 1, 10, 0, 6, 34, 1, 7, 0, 1, 100, 0, 0, 0, 7, 2, 193, 50, 3, 194, 17, 33, 3, 194, 9, 17, ... >>}

Tried to enforce it by doing many queries, but without luck. Happens every few hours, so it's hard to collect any more information. Can you see from the bytes what the problem might be?

Thanks!

dfrese commented 5 months ago

I also see this sporadically

(MatchError) no match of right hand side value: "03: no data found\n"

(jamdb_oracle 0.5.9) src/jamdb_oracle_tns_decoder.erl:495: :jamdb_oracle_tns_decoder.decode_ub4/1
(jamdb_oracle 0.5.9) src/jamdb_oracle_tns_decoder.erl:86: :jamdb_oracle_tns_decoder.decode_token/2
(jamdb_oracle 0.5.9) src/jamdb_oracle_tns_decoder.erl:131: :jamdb_oracle_tns_decoder.decode_token/3

erlangbureau / jamdb_oracle

Non-deterministic :bad_match and :function_clause errors (read_timeout) #176