EPIC: Improve frontend map visualization in Jupyter Notebooks

batpad commented 8 months ago

Currently, we have a few different ways to display (raster and vector) data and the results of analysis in a notebook on a map:

Using ipyleaflet to render basic maps using Leaflet
Using stac_ipyleaflet that wraps around ipyleaflet and provides a lot of functionality relevant to interfacing with data in a STAC catalog, and convenient UI and methods to browse data in STAC, including allowing users to draw bboxes and points on the map and get those coordinates back in Python, i.e. there is two-way communication between the interactive map and Python.
PyGMT that allows you to do fairly complex raster and vector operations, but outputs images and PDFs from the backend, and not interactive maps
Folium which also builds on top of Leaflet and has a lot of features for visualization, but is only "one-way", i.e. allows you to render an interactive slippy map using python code, but no ability to be able to interact with the map to populate variables back in your python code.
pydeck which uses deck.gl on the JS side but passes data from Python using GeoJSON. It also has a complex and hard-to-maintain DSL of trying to pass functions from Python to JS, which is kinda writing JS code in Python strings. See also lonboard docs "Lonboard vs pydeck".
kepler.gl-jupyter. kepler.gl is a browser based tool and is most useful for people who want to do analysis in the browser instead of in Python with visualization in the browser. It also uses GeoJSON, which struggles with large data. See also lonboard docs "Lonboard vs kepler.gl-jupyter".
Datashader which creates static PNG images on the backend in response to user input. So the user can "pan" on a map but any interactivity is backend, not frontend, and so it's a less-fluid experience. See also lonboard docs "Lonboard vs datashader"
.... ? This might not be a comprehensive list, so please add other existing tools that are used to visualize maps in Jupyter notebooks.

All the Leaflet-based maps use GeoJSON as the transfer format for vector data, and while this has historically been "the way to do things", it imposes quite severe limitations when trying to render large vector datasets, as one hits the limits of what the browser can JSON.parse as well as render onto traditional mapping libraries like Leaflet.

Advances in browser standards / technologies, GPU hardware, and standards around "cloud-native" vector formats allow us to now much more efficiently render large vector datasets by processing compressed binary data and rendering directly onto the GPU, bypassing a JSON parse step in Javascript, and fully leveraging the power of modern GPU hardware. We have been working on lonboard, which is a Jupyter widget that leverages these technologies to be able to scale to render millions of vector data points on an interactive map as opposed to thousands with a GeoJSON approach.

Proposal

lonboard provides us with a very scalable base to work on top off, and is built on the Deck.GL frontend mapping library.

In an ideal world, we would have a preferred, "blessed" way to render maps in a notebook, and not leave a user with 6 different choices with 6 different APIs and leave it upto them to figure out what the trade-offs of each are. It would also be helpful to be able to then have a single map object on the page that is bidirectionally interactive between python and javascript that gives users the features they need, as well as scales to large datasets. Unfortunately, this means that we would need to converge / standardize on a common base mapping library. My recommendation here would be to standardize on lonboard / deck.gl as we know that it scales, and then it's "just" a matter of porting over convenience features from other libraries. It is much harder / impossible to port over the scalability of lonboard to existing libraries because of the underlying technologies used.

This would take some work on the lonboard side to incorporate a lot of useful features that for eg. a library like stac_ipyleaflet provides - we would be able to re-use a lot of code, but it would still be a fair bit of work as the underlying mapping library being used is different, and things would need to be ported over.

Going to gather feedback on this ticket on whether this seems like an approach that makes sense, before writing out more detailed tickets about features we would want to incorporate into lonboard.

Would much appreciate any thoughts on whether this seems like a good idea, worth investing into, things to be aware of and any other suggestions / ideas that can help inform this decision and frame a path forward to consolidate the map visualization experience in notebooks.

cc @kylebarron @emmalu @sandrahoang686 @geohacker @wildintellect @j08lue @aboydnw @abarciauskas-bgse @yuvipanda (please tag anyone else that should see this / provide feedback here).

aboydnw commented 8 months ago

Thanks @batpad , this seems like a great coming-together of technologies for us. I tried to pull out the user problems that we are attempting to solve with this work. Can you confirm if I am understanding correctly? It might be helpful to start with these and talk about their severity if we need to make a pitch to impact.

Current notebook visualization methods struggle with large vector datasets (how large? and how often to people hit this limit?)
There is a wide array of options for notebook visualization methods, without much direction for users deciding which method is best for them (is the assumption that there are too many options? Or just that users need help deciding?)

I also feel like there might be a user problem in this statement, but I'm not knowledgeable enough on their workflows to articulate it well enough:

It would also be helpful to be able to then have a single map object on the page that is bidirectionally interactive between python and javascript that gives users the features they need

kylebarron commented 8 months ago

Current notebook visualization methods struggle with large vector datasets (how large? and how often to people hit this limit?)

I think with leaflet the max is probably around 50,000 features? I've never profiled leaflet exactly for this. With lonboard it depends only on the amount of GPU memory on the user's machine and their internet speed to access the compressed data. My computer (macbook pro m2) has successfully rendered 5-10 million coordinates, though it gets a bit laggy above 5 million. I know someone on twitter (with a good GPU) who was able to render 20 million building footprints without any lag.

how often to people hit this limit?

I'd argue that today not many people hit this limit... because they don't even try. They don't think it's possible and so it changes their behavior. But if people learn that they're able to visualize larger quantities of data, they're more likely to be excited about it.

j08lue commented 8 months ago

Possible use cases are the vector data exploration interfaces we already have on VEDA and the GHG Center:

They are currently not limited by the size of data we want to analyze, but it would be great to accompany the browser-based interfaces we already have with JupyterHub-based ones that allow users to analyze the data even further and "script" the UI.

wildintellect commented 8 months ago

There are so many more options; some static some dynamic, things like matplotlib, seaborn, cartopy, holoviz(contains Datashader and other options)

Static - Painful when you just want to see and explore because you have write so much code to make a plot beyond simple single layer, and then you can't interact with the data. But it is what you want when you need to make publication maps, pdfs, etc. We should not be overly concerned with Static maps at this time.

Dynamic - Quick, and interactive but very clunky at loading big data. Case 1 - loading a local raster/array over a dynamic map Case 2 - loading large amounts of vector data, I actually think the use experience gets bad way before the 50,000 features @kylebarron mentioned Case 1 & 2 together Case 3 - using web services for rasters and vectors <- this probably works the best as is. Other considerations (that make QGIS great) - projection support, science needs more than web mercator, styling adjustments without 20 lines of code, and a matching legend. Being able to export styles etc to be applied to static map generation.

emmalu commented 8 months ago

Great ideas here, @batpad et al! I agree that converging stac_ipyleaflet + lonboard would make for an ideal user and developer experience:

Even if most users don't think they need the 'power' of GPU-based rendering of large vector datasets (as @kylebarron mentioned above ⬆️); perhaps that's a self-imposed limitation that they can break out of in the future. We could set them up for that success by 'opting' the path of DeckGL.
From the dev perspective, maintaining two different libraries with the possibility of needing overlapping features seems risky and more time-consuming in the long run.

Some follow-up questions though:

Is there actual budget on VEDA for this now/next quarter?
Are there design questions & needs that we foresee, to go alongside this effort? (Thinking back to my collaboration with @heidimok on getting stac_ipyleaflet to a user-friendly state.)

kylebarron commented 8 months ago

I actually think the use experience gets bad way before the 50,000 features

@wildintellect have you used ipyleaflet much? Do you have any ballpark number in your mind for at what number of features the user experience suffers?

Even if most users don't think they need the 'power' of GPU-based rendering of large vector datasets (as @kylebarron mentioned above ⬆️); perhaps that's a self-imposed limitation that they can break out of in the future.

💯 It's sort of my premise that most people wouldn't even list this as a problem they have, because it's not even on their radar that it's possible.

NASA-IMPACT / veda-jupyterhub

EPIC: Improve frontend map visualization in Jupyter Notebooks #13

Proposal