Remove the requirement for `std::io::Seek` for PacketReader etc.

TonalidadeHidrica commented 3 years ago

Currently, readers like PacketReader requires std::io::Seek. But the parsing process of Ogg container is actually streamable. Therefore, it may be more useful to remove those constraints for most of functions, except for seek_bytes and seek_absgp, which truly demand Seek.

est31 commented 3 years ago

Seek is required so that an additional layer of buffering can be avoided. Basically, sometimes there can be crap between ogg pages, so the ogg reader reads chunks of data to read until the next header. Once it found the header, it knows its exact position so it can seek there and further code doesn't need to handle partially parse chunks.

That's why Seek is required.

TonalidadeHidrica commented 3 years ago

So, is seek required when returning to beginning of the header that was found while recovering from corrupt? Then, isn't it enough to store merely thirty-or-something page header buffer? Maybe I am just misunderstanding your explanation. Could you show me the code for the processed you described?

est31 commented 3 years ago

@TonalidadeHidrica

https://github.com/RustAudio/ogg/blob/1da041df375aecaaa935971a621c0f69e98fc269/src/reading.rs#L752-L773

TonalidadeHidrica commented 3 years ago

After reading the code for n hours, I finally understand why Seek is needed. When searching for header, if there wasn't a magic within the first 27 bytes, it reads the succeeding output in a (maximum of) 1024-byte chunk each. This is repeated until OggS magic is found. When it was found in the midst of the buffer read, the pointer of underlying reader should be moved back so that the subsequent process can read the content of the page succeeding right after the page header, and that's why the reader has to be Seeked, right?

But I think this can be done by reading byte-by-byte from the reader while matching against the magic pattern. Then it no longer require Seek. Additionally, I doubt the current algorithm enables us to avoid "an additional layer of buffering", because it simply discards the bytes it obtained from the reader rather than passing them to chunk reader. On the contrary, if some buffered reader is provided to the ogg reader, it even results in the data buffered twice.

What do you think? If my idea seems to be good, I'll refactor the UntilPageHeaderReader.

est31 commented 3 years ago

When it was found in the midst of the buffer read, the pointer of underlying reader should be moved back so that the subsequent process can read the content of the page succeeding right after the page header, and that's why the reader has to be Seeked, right?

Yup that's correct.

But I think this can be done by reading byte-by-byte from the reader while matching against the magic pattern. Then it no longer require Seek.

That would work from a functionality standpoint, but note that it would then call the read function many many times. If the implementation calls a syscall inside, it would be quite slow from a performance standpoint. This can be fixed by using a BufReader or requiring the BufRead trait. That's what I meant by additional level of buffering.

TonalidadeHidrica commented 3 years ago

I know that there is a concern of read function called over and over again, but I think it should be the user's responsibility to avoid it by wrapping the reader by a BufRead-ish reader. We don't have to reinvent the wheel, why not delegate it to an existing one?

By the way, when it comes to adopt my idea, which do you think is better: require BufRead instead of Read, or write an extra caution on the document? The former enables us to make sure that the user will never write an inefficient code, while the latter provides with more flexibility. I don't think wrapping with BufReader internally is not an good idea neither.

est31 commented 3 years ago

require BufRead instead of Read, or write an extra caution on the document?

That'd be the best option imo. I'm not that convinced that the Seek requirement is that bad tho.

TonalidadeHidrica commented 3 years ago

I'm wondering which of those two:

require BufRead instead of Read, or
write an extra caution on the document

is better, and I want your opinion. (Sorry for my confusing text.)

I think that it'd be better to remove Seek requirements because of RustAudio/rodio#333 . When decoding an streamed audio data via internet, as the author of the issue says, it's hard to implement Seek on the reader itself. Removing the bound makes the library more flexible.

florian1345 commented 2 years ago

I have recently run into this issue myself. Do I understand correctly that seeking can only occur in UntilPageHeaderReader::do_read up to ~1024 (or whatever the length of the buffer) bytes backwards? If so, I would suggest a solution that would preserve the behavior of all projects that currently use this crate.

Add a trait that requires a method to read into a newly allocated buffer of 1024 and return any AsRef<[u8]> (which I will call read_revertable) as well as a method to seek backward at most 1024 bytes (which I will call revert).
Blanket-implement that trait for all types implementing Seek as doing what UntilPageHeaderReader::do_read does now: allocate a new [u8; 1024], read into it, and return it. As for revert, just use Seek.
Implement the new trait for a wrapper around types implementing only Read that, on read_revertable, reads into its own buffer and returns only a reference to it. When it is asked to revert, it remembers to output bytes from the stored buffer first before querying the wrapped Read. Branch prediction should keep performance loss at a minimum in the common case when no seeking is required.
Replace the allocation of a buffer and the reading done in UntilPageHeaderReader::do_read by a call to read_revertable and the seeking by a call to revert.

Unless I am missing something, this should still yield the same behavior for types implementing Seek, but allow users to efficiently use the decoder with types that do not implement Seek with an explicit opt-in. It also only buffers whenever it is necessary, so it should be more light-weight than using a BufReader.

An alternative fix would be to expose the size of the internal buffer to allow users to implement a custom buffering Seek wrapper without having to cache the entire file or rely on implementation details of this crate that may be subject to change.

I might be missing something here, but if any of these ideas sound good, I can attempt an implementation myself.

est31 commented 2 years ago

@florian1345 that proposal sounds good! You are welcome to file a PR :).

RustAudio / ogg

Remove the requirement for `std::io::Seek` for PacketReader etc. #12