OpenGeoscience / geonotebook

A Jupyter notebook extension for geospatial visualization and analysis
Apache License 2.0
1.08k stars 141 forks source link

Create a RasterData from in-memory 2D array #127

Open kilroy68 opened 7 years ago

kilroy68 commented 7 years ago

The RasterData class needs to be able to be instantiated with an in-memory 2D array. When doing analysis it is common to have in-memory results that you don't want to write to file before viewing. For instance, consider this code:

from osgeo import gdal
import numpy as np
driver = gdal.GetDriverByName('MEM')
src_ds = driver.Create('', 100, 200, 1)
band = src_ds.GetRasterBand(1)

ar = np.random.randint(0, 255, (200, 100))
band.WriteArray(ar)

at this point, I should be able to instantiate a RasterData from the band I just created.

aashish24 commented 7 years ago

thank you @kilroy68 for posting this issue. We have talked about supporting in-memory raster data but my understanding is that it is not a trivial task. I am going to ask @kotfic to provide more detail on it but it is something of high interest to us as well.

kotfic commented 7 years ago

@kilroy68 The primary issue here is the process which owns the memory allocated by gdal. Jupyter's architecture includes two separate process spaces, one for python execution environment (the kernel) and one for serving web assets (tornado). Cell execution takes place in the kernel process, while tile requests are handled by the tornado process. Producing tiles with tornado from in-memory kernel results will require some form of inter-process communication (ideally using shared memory between processes).

The feature you're suggesting is a high priority for us, but we need to do more research to identify the best approach given the Jupyter architecture. If you (or others) have solved similar problems we'd love to hear about your approach!

kilroy68 commented 7 years ago

I don't have any experience in this space, but I would look at matplotlib's implementation. It has a similar problem in that the data i'm plotting is in the python kernel, but it needs to produce interactive web visuals through tornado.

kotfic commented 7 years ago

matplotlib generates image data in the kernel process space and jupyter wraps a rendering backend to push those images to the client via a kernel <--- zeromq ---> tornado <--- websocket ---> client bridge. This is why you need to do %matplotlib inline (to wrap the renderer and let jupyter know what to do with image based return types). Creating a custom bridge using zeromq from the kernel to the tornado server for tile serving is a possibility but there are two critical problems with this approach, either:

  1. the data needs to be copied into the tornado processes' addressable memory preventing large amounts of data from being rendered effectively (one thing we've considered here is setting up an in-memory file system - which would resolve some other potentially show-stopping issues with mapnik).
  2. the tiles need to be generated in the kernel, which will either prevent cell execution while tiles are being rendered, or require running a separate threaded tile server inside the kernel (ugly, but maybe possible?). Even if that was possible/feasible we would still need to use mapnik for down sampling data to render tiles at different resolutions. While mapnik has a way of reading data from in memory it assumes you've allocated an empty memory container and are pushing data into that container. I don't believe it has mechanisms for wrapping already in-memory data for doing down sampling and styling operations (but maybe this is something we could extend?).

This is basically as far as we've gotten. There are still avenues to explore, and I'm hopeful we'll be able to come up with a solid solution so we can deliver this feature (you're not the only one who has asked!). But as I hope I've illustrated, its a non-trivial effort to implement.