adwhit / itchy-rust

NASDAQ ITCH parser in Rust
MIT License
46 stars 14 forks source link

Partial reads #3

Open adwhit opened 7 years ago

adwhit commented 7 years ago

At the moment, the code assumes that when we read to the buffer from a reader, it either fills the buffer totally or fills it to EOF. However I believe it is possible to partially fill the buffer even when not at EOF, and I think this will cause breakage. Investigate and fix

petr-tik commented 5 years ago

Not sure if this partial reads is what I understand by stream-friendly parsing.

The way I see it now, the library is best equipped to parse itch messages from a binary blob on disk, which is most useful to replay applications but not parsing ITCH messages from network buffers.

According to the current implementation of next: https://github.com/adwhit/itchy-rust/blob/master/src/lib.rs#L145-L202

If parsing fails with Err::Incomplete, we copy left over bytes, fill the buffer to the end from our reader and, crucially, restart parsing from the start of the buffer.

This way we might successfully parse all but 1 last byte of a StockDirectory message with the last byte (bool for inverse_indicator) missing. Currently, we first copy all the bytes that we just successfully parsed over to the beginning of the buffer, then read more bytes until the end of the buffer and restart parsing thus doubly parsing the first 30+ bytes of a StockDirectory message.

We can increase the size of the buffer - thus lowering the number of chances for messages to be broken by the buffer boundary. This will solve the problem for reading itch files, but not streaming them.

I have been googling on the topic and it looks like the combine parser library and parsell support partial parsing.

I appreciate this would mean a re-write and would only do it, if there was a serious need for it, so this ticket is more of a literature/problem domain review rather than a call to action for now.

adwhit commented 5 years ago

I think what you have outlined is a separate problem from the issue. I think my concern was that in general, the read system call can read an arbitrary number of bytes into a buffer and may not fill it up even if we are not at EOF (though usually it will). In which case, the library will error.

However looking at the code I cannot see why I thought this was the case, and it's a difficult scenario to test.

Re: your point, TBH the cost of a re-parse is tiny and relatively rare. Unless you are doing serious HFT an cannot afford any jitter, I would just increase the size of the buffer a bit (or a lot).