apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.6k stars 794 forks source link

Check valid byte_range in parquet Column Chunk reading? #6255

Open mapleFU opened 3 months ago

mapleFU commented 3 months ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

See: https://github.com/apache/parquet-testing/pull/58#issuecomment-2290985490

When reading a corrupt file, currently, arrow-rs would have:

    /// Returns the offset and length in bytes of the column chunk within the file
    pub fn byte_range(&self) -> (u64, u64) {
        let col_start = match self.dictionary_page_offset() {
            Some(dictionary_page_offset) => dictionary_page_offset,
            None => self.data_page_offset(),
        };
        let col_len = self.compressed_size();
        assert!(
            col_start >= 0 && col_len >= 0,
            "column start and length should not be negative"
        );
        (col_start as u64, col_len as u64)
    }

Would we better check the range here?

Describe the solution you'd like

Checking the range when building the group reader or in "byte_range()"

Describe alternatives you've considered

    /// Returns the offset and length in bytes of the column chunk within the file
    pub fn byte_range(&self) -> Result<(u64, u64)> {
        let col_start = match self.dictionary_page_offset() {
            Some(dictionary_page_offset) => dictionary_page_offset,
            None => self.data_page_offset(),
        };
        let col_len = self.compressed_size();
        if col_len < 0 || col_len < 0{
            return Err(ParquetError::General(
                "column start and length should not be negative".to_string(),
            ));
        }
        (col_start as u64, col_len as u64)
    }

Additional context

mapleFU commented 3 months ago

@alamb I'm willing to try this but I'm not so familar with parquet-rs. Do you think this would better be checked in ColumnChunkMetaData::byte_range or ColumnChunkMetaData::from_thrift? Or this is already checked, we don't require this?

alamb commented 3 months ago

@alamb I'm willing to try this but I'm not so familar with parquet-rs. Do you think this would better be checked in ColumnChunkMetaData::byte_range or ColumnChunkMetaData::from_thrift? Or this is already checked, we don't require this?

I am not sure -- I think the first thing we should do is get a reproducer. Let me see if I can whip up some tests

alamb commented 3 months ago

See https://github.com/apache/arrow-rs/issues/6261 / https://github.com/apache/arrow-rs/pull/6262

alamb commented 3 months ago

Describe alternatives you've considered

That looks good to me, FWIW