Support GeoSpark - Githubissues

harryprince commented 5 years ago

Hi mapdeck team: I saw mapdeck is a great tool to viz large-scale data, and I found GeoSparkViz is doing a similar thing, which can rasterize a large point dataset into an image file and pass as a base64 string to Apache Zeppelin.

Indeed, I think RStudio is a better frontend than Apache Zeppelin for the geospatial data scientist.

Does it possible to integrate GeoSpark as a backend to render large-scale geospatial data in mapdeck?

Currently, I made a geospark R package already, welcome to discuss more.

References

https://datasystemslab.github.io/GeoSpark/tutorial/viz/

SymbolixAU commented 5 years ago

Hi @harryprince , This is an interesting idea and I'm keen to know more about how you think GeoSpark can be used as a backend to mapdeck. Do you have an example of the workflow you have in mind? How would the data get from GeoSpark to mapdeck?

harryprince commented 5 years ago

Here is a great example borrows from GeoSpark official document, and I think viz from code is better than Apache Zeppelin, which means more reproducible.

Here is another example to show how GeoSparkViz to render a very large raster image.

harryprince commented 5 years ago

@SymbolixAU The real-time geospatial monitoring now can support by geospark and sparklyr, and wish it could be viz by shiny and mapdeck.

here is an example code snippet.

library(future)
library(sparklyr)
library(geospark)
library(dplyr, warn.conflicts = FALSE)

sc <- spark_connect(master = "local", spark_version = "2.2.0")
register_gis(sc)

if(file.exists("source")) unlink("source", TRUE)
if(file.exists("source-out")) unlink("source-out", TRUE)

stream_generate_test(iterations = 1)
read_folder <- stream_read_csv(sc, "source") 

process_stream <- read_folder %>%
 mutate(a = x*0.02, b = x*0.02) %>%
 mutate(y = ST_AsText(st_point(a,b))) %>%
  mutate(x = as.double(x)) %>%
  ft_binarizer(
    input_col = "x",
    output_col = "over",
    threshold = 400
  )

write_output <- stream_write_csv(process_stream, "source-out")
invisible(future(stream_generate_test(interval = 0.2, iterations = 100)))

cat source-out/part-00000-afb2798b-44e2-4a33-ba71-32681da14096-c000.csv

x,a,b,y,over
1.0,0.02,0.02,POINT (0.02 0.02),0.0
2.0,0.04,0.04,POINT (0.04 0.04),0.0
3.0,0.06,0.06,POINT (0.06 0.06),0.0
4.0,0.08,0.08,POINT (0.08 0.08),0.0
5.0,0.10,0.10,POINT (0.1 0.1),0.0
6.0,0.12,0.12,POINT (0.12 0.12),0.0
7.0,0.14,0.14,POINT (0.14 0.14),0.0
8.0,0.16,0.16,POINT (0.16 0.16),0.0
9.0,0.18,0.18,POINT (0.18 0.18),0.0
10.0,0.20,0.20,POINT (0.2 0.2),0.0
11.0,0.22,0.22,POINT (0.22 0.22),0.0
12.0,0.24,0.24,POINT (0.24 0.24),0.0

Here is another Uber trip example:

https://vivekkatial.shinyapps.io/uber_shiny

SymbolixAU commented 5 years ago

Would the idea be to take the process_stream object directly and plot it on a map, rather than writing to disk?

harryprince commented 5 years ago

@SymbolixAU You can write stream data into memory instead of the disk:

process_stream %>%
  stream_write_memory("urls_stream", mode = "complete")

And it support sparklyr::reactiveSpark() to do further viz in shiny, reference to rstudio blog release note.

If you are not familiar with sparklyr, here is a awesome-sparklyr collection: https://github.com/harryprince/awesome-sparklyr

SymbolixAU commented 5 years ago

Could you give me a dput() output of example data, say only 5 lines, generated by

process_stream %>%
  stream_write_memory("urls_stream", mode = "complete") %>%
  dput()

Or however is best to generate it?

harryprince commented 5 years ago

@SymbolixAU example code snippet:

polygons_wkt <- read.table(system.file(package="geospark",sprintf("examples/%s.txt","polygons")), sep="|")
points_wkt <- read.table(system.file(package="geospark",sprintf("examples/%s.txt","points")), sep="|")

stream_generate_test(points_wkt, "source/")
point_stream <- stream_read_csv(sc, "source/",delimiter = ",") %>% 
    mutate(geom = st_geomfromwkt(geom)) 

polygon_sdf = points_wkt %>% copy_to(sc,"polygon_sdf")

polygon_sdf %>% sdf_register("polygon_sdf")
point_stream %>% sdf_register("point_stream")

stream <-   tbl(sc, sql("
  SELECT area, state, count(*) cnt FROM
    polygon_sdf
  INNER JOIN point_stream
  WHERE ST_Contains(polygon_sdf.geom,point_stream.geom) GROUP BY area, state")) %>% 
   reactiveSpark()

library(shiny)

ui <- fluidPage(DT::dataTableOutput("a"))

server <- function(input, output){
    output$a <- renderDataTable({
        stream() %>% 
            as.data.frame() %>% 
            DT::datatable()
    })
}

shinyApp(ui=ui,server=server)

SymbolixAU commented 5 years ago

which libraries are these functions from?

stream_generate_test()
stream_read_csv()
st_geomfromwkt()

harryprince commented 5 years ago

which libraries are these functions from? ...

@SymbolixAU stream* comes from sparklyr and st* comes from geospark

_{Sent with GitHawk}

SymbolixAU / mapdeck

Support GeoSpark #157

References