bovee / entab

* -> TSV
MIT License
21 stars 5 forks source link

Rewrite to remove `refill`s from parsers #15

Closed bovee closed 2 years ago

bovee commented 2 years ago

Right now there's a pattern we use like the following (simplified from the FASTA parser):

    let (start, end) = loop {
        if let Some(p) = memchr(...) {
            break ...;
        } else if rb.eof() {
            return Err(...);
        }
        rb.refill()?;
    };

The advantage of this is that we don't have to reparse the record out if we need to refill (because we don't exit out and then reenter once the buffer is refilled), but it also means that the lifetime of a record returned from the parser is fixed to right after refill is called (every time we parse, we could potentially wipe the underlying buffer and invalidate all previous records). Theoretically, this means we save time whenever we refill, but at the cost of code clarity and some lifetime constraints. The amount of time saved may not be that great for larger buffers or for mmaped files though.

If we moved the refill logic out of the parser into the caller and returned a sentinel value instead, we could simplify to:

    if let Some(p) = memchr(...) {
        (start, end) = ...
    } else if rb.eof() {
        return Err(...);
    } else {
        return Ok(Incomplete)
    }

This is much closer to how parser libraries like nom handle these situations so it probably won't have a noticeable performance impact? It would require a large rewrite of the parser code and all of the parsers, but I don't think many people are using entab right now that would object to the breakage.

bovee commented 2 years ago

So I wasted a week or so futzing with this and I think it's basically impossible with Rust's lifetime model.

I rewrote some parsers to implement the following (note, buffer is immutable now and we're only modifying buffer_state which tracks things like how many bytes are consumed):

pub trait FromSlice<'b>: Sized + Default {
    type State;

    fn get(
        &mut self,
        buf: &'b [u8],
        buffer_state: &mut BufferState,
        parser_state: &mut Self::State,
    ) -> Result<Option<usize>, EtError>;
}

And then tried to rewrite the buffer next function like:

    fn next<'r, T>(&'r mut self, buffer_state: &mut BufferState, mut state: <T as FromSlice<'r>>::State) -> Result<Option<T>, EtError>
    where
        T: FromSlice<'r> + Default,
    {
        let mut record = T::default();
        loop {
            let buffer = &self.as_ref()[buffer_state.consumed..];
            match T::get(&mut record, &buffer, buffer_state, &mut state) {
                Ok(Some(consumed)) => {
                    buffer_state.consumed_record += consumed;
                    buffer_state.consumed = buffer_state.consumed_record;
                    buffer_state.record_pos += 1;
                    break;
                },
                Ok(None) => return Ok(None),
                Err(e) => {
                    // I also added an "incomplete" flag to EtError to allow the parsers to signal that they needed the buffer to be refilled.
                    if !e.incomplete || buffer_state.eof {
                        return Err(e);
                    }
                },
            }
            if !self.refill(buffer_state)? {
                return Ok(None);
            }
        }
        Ok(Some(record))
    }

This doesn't work at all though because Rust can't tell that the immutable slice borrow that T::get takes isn't returned by the time we get to the mutable borrow to update the buffer. Basically, the associated lifetime on FromSlice is capturing the lifetime from self so we can't do anything further, even though that lifetime should be invalid if we haven't parsed out a value.

I'm not really sure there's a good way around this either? An alternative could be for each parser to first try to parse out a record and return a true/false, if false then refill the buffer and try again, and then finally pull out a full record. This would require either some amount of double parsing or some kind of "parsing guide" that the try step can save for the final step. Either would probably create some kind of speed hit.