bovee / entab

* -> TSV
MIT License
21 stars 5 forks source link

Allow `read_into` parsers #20

Closed bovee closed 2 years ago

bovee commented 2 years ago

It should be possible to reuse the e.g. FastaRecord from iteration to iteration to avoid an allocation each time, but I'm having issues with lifetimes trying to write out a function that does this.

I tried the following in buffer.rs, but I kept having lifetime errors with both the state and ReadBuffer being still mutably borrowed in the next iteration of the loop.

pub fn next_into<'n, T>(
    &'n mut self,
    record: &mut T,
    mut state: <T as FromSlice<'n>>::State,
) -> Result<bool, EtError>
where
    T: FromSlice<'n> + 'n,
{
...
}
bovee commented 2 years ago

I did some more futzing inside chunk where some of the other lifetime issues aren't so bad and got to the following:

impl BufferChunk {
    ...
    pub fn next_into<'n, T>(
        &mut self,
        record: &mut T,
        buffer: &'n [u8],
        mut state: <T as FromSlice<'n>>::State,
    ) -> Result<bool, EtError>
    where
        T: FromSlice<'n>,
    {
        let consumed = self.consumed;
        match T::parse(
            &buffer[consumed..],
            self.eof,
            &mut self.consumed,
            &mut state,
        ) {
            Ok(true) => {}
            Ok(false) => return Ok(false),
            Err(e) => {
                if !e.incomplete || self.eof {
                    return Err(e.add_context(
                        buffer,
                        self.consumed,
                        self.record_pos,
                        self.reader_pos,
                    ));
                } else {
                    return Ok(false);
                }
            }
        }
        self.record_pos += 1;
        T::get(record, &buffer[consumed..self.consumed], &state)
            .map_err(|e| e.add_context(buffer, self.consumed, self.record_pos, self.reader_pos))?;
        Ok(true)
    }
}

And the following test:

    #[test]
    fn test_chunked_read_into() -> Result<(), EtError> {
        let f: &[u8] = include_bytes!("../tests/data/test.fastq");
        let (mut rb, mut state) = init_state::<FastqState, _, _>(f, None).unwrap();
        let mut seq_len = 0;
        while let Some((slice, mut chunk)) = rb.next_chunk()? {
            let mut record = FastqRecord::default();
            while !chunk.next_into(&mut record, slice, &mut state).map_err(|e| e.to_string())?
            {
                let FastqRecord { sequence, .. } = record;
                seq_len += sequence.len();
            }
        }
        assert_eq!(seq_len, 250000);
        Ok(())
    }

Throws the following:

error[E0499]: cannot borrow `state` as mutable more than once at a time
   --> entab/src/chunk.rs:179:56
    |
179 |             while !chunk.next_into(&mut record, slice, &mut state).map_err(|e| e.to_string())?
    |                                                        ^^^^^^^^^^ `state` was mutably borrowed here in the previous iteration of the loop
bovee commented 2 years ago

Okay, finally figured this out: the compiler can't tell that the references in the record we're taking to the buffer and the state aren't being clobbered the next time next_into is run (if a parser doesn't update all of the fields each time it runs, it could have a reference to an old state). I "fixed" this by some hacky transmuting and made the function unsafe (to warn potential users) and now everything works; the one gotcha is that the compiler can no longer tell if you try to keep a reference past the lifetime of an old state (in next_into only) so there may be undefined behavior if this happens.

Closed in c01cd4bdcdd5834a8c35a8a0a500ce25d16b5979.