georust / gdal

Rust bindings for GDAL
https://crates.io/crates/gdal
MIT License
339 stars 92 forks source link

`Rasterband::write` with `Buffer<T>` forces unnecessary data copy. #453

Open metasim opened 9 months ago

metasim commented 9 months ago

When working with large rasters, minimizing the number of data copies is important. The Rasterband::write method as it currently stands, forces some users into an additional buffer copy.

To illustrate, consider the following use case (pseudo-ish code):

let mut data: Vec<u8> = vec![0u8; <some big number>];
for i in 0..100 {
   modify_data(&mut data);
   write_data_with_gdal(&data);
}

Seems reasonable, no? And let's assume for external reasons we only have control over the implementation of write_data_with_gdal. Problem arises when we try to implement it:

fn write_data_with_gdal(data: &Vec<u8>) {
  let driver = DriverManager::get_driver_by_name("GTiff").unwrap();
  let ds = driver.create(...).unwrap();
  let mut rb = ds.rasterband(1).unwrap();
  rb.write(/*... what to do here? ...*/).unwrap();
}

The Rasterband::write method looks like this:

pub fn write<T: GdalType + Copy>(
        &mut self,
        window: (isize, isize),
        window_size: (usize, usize),
        buffer: &Buffer<T>,
    ) -> Result<()> {

It wants the data as as a reference to a Buffer<T>, which looks like this:

pub struct Buffer<T: GdalType> {
    pub size: (usize, usize),
    pub data: Vec<T>,
}

The contained Vec is owned by Buffer::data. We only have a &Vec<u8>. So to call write, even though write only needs a Buffer<T> reference, the only way to create the referent is to clone our &Vec<u8> parameter:

fn write_data_with_gdal(data: &Vec<u8>) {
  let driver = DriverManager::get_driver_by_name("GTiff").unwrap();
  let ds = driver.create(...).unwrap();
  let mut rb = ds.rasterband(1).unwrap();
  let dims = (data.len(), 1);
  let buf = Buffer { size: dims, data: data.clone() }; // <--- no way to avoid `clone()`
  rb.write((0, 0), dims, &buf).unwrap();
  // `buf` is dropped...
}

I'd argue we need an alternative to this write method for users not wishing to incur the additional memory and time overhead. One might argue that the user should only work with Buffer<T> instances, but given that working interoperability with other Vec<T>-based APIs is very common in the raster space, we should support writing a Vec<T> without the clone() cost.

lnicola commented 7 months ago

For anyone running into this, you can use the following idiom as a workaround:

fn write_data_with_gdal(data: Vec<u8>) -> Vec<u8> { // pass by value and return back
  let driver = DriverManager::get_driver_by_name("GTiff").unwrap();
  let ds = driver.create(...).unwrap();
  let mut rb = ds.rasterband(1).unwrap();
  let dims = (data.len(), 1);
  let buf = Buffer { size: dims, data: data }; // no clone()
  rb.write((0, 0), dims, &buf).unwrap();
  let Buffer { data, .. } = buf;
  data
}
metasim commented 6 months ago

@lnicola What if you have a &[u8]... Is there are workaround for that?

lnicola commented 6 months ago

No, we'd need two buffer variants or a view over owned or borrowed data.

metasim commented 6 months ago

Would you be opposed to a RasterBand::write_slice method?

lnicola commented 6 months ago

We'd need another one for the block function(s) though.