Thriftpy / thriftpy2

Pure python approach of Apache Thrift.
MIT License
572 stars 91 forks source link

Thriftpy2 hangs while decoding string #131

Closed carloVentrella closed 4 years ago

carloVentrella commented 4 years ago

Hi, while using impyla to fetch results from impala, we faced a problem that seems related to the decode phase in thriftpy.

We're using version 0.4.11

When using the default batch_size (but we also tried with smaller ones), we see that thriftpy gets stuck while parsing the input buffer. I report the snippet from binary.py, enriched with some prints, useful during troubleshooting.

elif ttype == TType.STRING:
        print("reading string")
        sz = unpack_i32(inbuf.read(4))
        byte_payload = inbuf.read(sz)        
        print("sz: %s", sz)
        print("byte_payload: %s", byte_payload)

We did some troubleshooting. The impala query is very simple:

SELECT col FROM default.mytable where date="20200120" order by col2 limit 300

We can see that thriftpy is able to decode many rows, then it blocks into inbuf.read(sz). Interstingly, we noticed that the last payload reports some bytes that are not present in the original column.

byte_payload: %s b'44444|A1|11111|20200120|980e417f-4773-4d30-7703-5821420ac5eb'
reading string
sz: %s 45
byte_payload: %s b'c6c6c|AXXX_YYY|11111111|20200120|689541623112'
reading string
sz: %s 60
byte_payload: %s b'55555|A1|00000|20200120|2d8638d7-f5f6-4b51-ba58e1a0b15cb\x00\x00\x00='
reading string
<--- stuck on input.read(sz) of the next row ---> 

We see that a \x00\x00\x00= is being added to the original value.

This is not a problem of the value itself, because the following query returns it correctly:

SELECT *  from default.mytable where col like '55555|A1|00000|20200120|2d8638d7-f5f6-4b51-ba58e1a0b15cb' limit 1

We also tried to run similar queries, but we got similar issue. I report the most significant. Before getting stuck we get:

byte_payload: %s b'15155|A1|81500|20200120|2d8638d7-f5f6-4b51-ba58e1a0b15cb'

Here there are no strange characters, but the value is not correct. In impala the value is the following:

15155|A1|81500|20200120|2d8638d7-f5f6-4b51-baa5-858e1a0b15cb

It's not immediate, but in the print added inside thriftpy, there's a a5-8 which is not present in the database.

Do you have any idea of what can be the problem?

Thank you!

ethe commented 4 years ago

Hi, it is not easy to reproduce your case, could you please find a easier way to reproduce it? Then I can start to test and research it.

ethe commented 4 years ago

Maybe you can just construct the data that can reproduce the bug and decode it directly.

ethe commented 4 years ago

Closed for a while.