Skip to specific byte sequence

dmendel / bindata

BinData - Reading and Writing Binary Data in Ruby

BSD 2-Clause "Simplified" License

577 stars 55 forks source link

Skip to specific byte sequence #78

Closed stefan-kolb closed 8 years ago

stefan-kolb commented 8 years ago

Hi,

I'm having a stream of bytes with different data structures for some of which I don't know the exact structure. however, I know certain byte constants that will occur inside the stream. Is there any way to seek for a specific byte sequence inside a stream an skip to it?

dmendel commented 8 years ago

BinData doesn't provide read-ahead and pushback for streams, which is what you'd need for your use case.

It's a useful feature so I'll consider adding it.

If your steam is seekable, you could do two passes. The first pass would find the offsets for your specific byte sequences and the second pass would perform the actual parsing.

stefan-kolb commented 8 years ago

Yeah, that's what im looking for. Something like skip :to => byte_sequence where we seek from the current stream position with a ring buffer of the size of the byte_sequence and set the position to the next byte after the sequence. Pushback would also be nice :smile: Btw, thanks a lot for your work on this awesome library :+1:

dmendel commented 8 years ago

Added in dmendel@4232239.

You can now skip to any BinData expression, not just a byte sequence. Syntax is:

class A < BinData::Record
  skip do
    string :read_length => 4, :assert => "abcd"
  end

  # we are now aligned to 'abcd'
end

stefan-kolb commented 8 years ago

Wow, that was fast :smile: ! Thank you so much! Just two questions:

Do we really need a :read_length here? Shouldn't this be determined by the size of the assertion?
The code does not only search in :read_length chunks?! If so, we would not find chunks like cde in abcdef if we set read_length to 3 and start from zero offset. Maybe an explicit test for this should be added.

dmendel commented 8 years ago

Do we really need a :read_length here? Shouldn't this be determined by the size of the assertion?

Yes we do need :read_length for clarity when using multibyte characters. This was discussed previously here: https://github.com/dmendel/bindata/issues/40

The code does not only search in :read_length chunks?!

There's a more detailed example in the wiki. https://github.com/dmendel/bindata/wiki/AdvancedIO#skipping-over-unused-data

If so, we would not find chunks like cde in abcdef if we set read_length to 3 and start from zero offset.

The search strategy is byte by byte, not chunk by chunk. If you find a case where it doesn't work, please file a bug.

stefan-kolb commented 8 years ago

Great, thanks for the explanation! :+1: