dcooley / sfheaders

Build sf objects from R and Rcpp
https://dcooley.github.io/sfheaders/
Other
74 stars 5 forks source link

sf_ - to_list #70

Closed dcooley closed 4 years ago

dcooley commented 4 years ago

the reverse of #69 - keep all the rows and make it a list, where each list element is the length of the sfg_id

list_columns can be R_NilValue (or NULL).

I don't want to pass another value through all the sf_() functions. So maybe I need to make an sfheaders::api:: which will switch on the type, and will return to it before the create_df() call, so I can then pass in the list_columns...

TODO

dcooley commented 4 years ago

@mdsumner I've added a new argument to the sf_ functions called list_columns, which you use to specify any columns of the df where you want to keep all the values

df <- data.frame(
  id = c(1,1,2,2)
  , x = 1:4
  , y = 1:4
  , val = letters[1:4]
  , stringsAsFactors = FALSE
)

sf_linestring(
  obj = df
  , x = "x"
  , y = "y"
  , linestring_id = "id"
  , keep = TRUE
  , list_columns = "val"
)

#   id  val   geometry
# 1  1 a, b 1, 2, 1, 2
# 2  2 c, d 3, 4, 3, 4

vs the current behaviour

sf_linestring(
  obj = df
  , x = "x"
  , y = "y"
  , linestring_id = "id"
  , keep = TRUE
)

## here in 'val', only the first row of the `df$val` is kept
#   id val   geometry
# 1  1   a 1, 2, 1, 2
# 2  2   c 3, 4, 3, 4

Do you foresee any issues if I add this arg to your sf-helper functions?

mdsumner commented 4 years ago

That's interesting. I don't think it can be a problem, but I'd need to try it with a few things.

It's strange one, effectively you are allow vertex attributes beyond x, y, z, m - by storing them one-to-one "elsewhere" (which is what every spatial system does when butting up against the limitations of SF).

This is very complementary to PATH in silicate, it tries to keep these vertex attributes and in theory allows for any set to be stored in the $vertex table - but it's only really functional for track data (with date-time, depth, temperature, whatever was measured) and not mature. The issue is $vertex is normalized (deduplicated) in some geometry xy, or xyz, or xyt etc and so other attributes (all in a higher dimensionality along with x,y that has no inherent duplication) end up on the $path_link_vertex table. It's powerful I think but not developed.

(Heh, sorry, but this might help bridge that model).

I'll try to have a look soonish. Thanks for the heads up!

mdsumner commented 4 years ago

Oh, do you have a real-world example data set that's relevant? Pretty much any tracking data is, and (not a great example but relevant) the walrus818 in trip:

d <- tibble::as_tibble(trip::walrus818)

# A tibble: 10,558 x 6
   Deployment DataDT                Wet Forage X_AED170_70 Y_AED170_70
        <int> <dttm>              <int>  <int>       <dbl>       <dbl>
 1        353 2009-09-15 04:00:00     1      0      281017       22532
 2        353 2009-09-15 05:00:00     0      0      281399       22392
...

sf_linestring(d, x = "X_AED170_70", y = "Y_AED170_70", z = "DataDT", linestring_id = "Deployment")

Here list_columns would be c("Wet", "Forage").

Q: does keep remain? So we don't have to specify the columns, "just do it"?

mdsumner commented 4 years ago

crs for Walrus818 fwiw:

crsmeta::crs_proj(trip::walrus818)
[1] "+proj=aeqd +ellps=WGS84 +lon_0=-170 +lat_0=70"
mdsumner commented 4 years ago

Q: does keep remain? So we don't have to specify the columns, "just do it"?

Oh I see, list-columns are separate, and compatible with keep - very good. ( at first I though you only added one column with 1 or more vectors in each list)

dcooley commented 4 years ago

Oh, do you have a real-world example data set that's relevant? Pretty much any tracking data

No, nothing specific, but my use-case is the multi-coloured path and polygon vertices, and multi-coloured trips (e.g. to represent loads on public transport)

effectively you are allow vertex attributes beyond x, y, z, m

yep, exactly.

mdsumner commented 4 years ago

All good! Had a bit of refresh on this concept and what I've tried before, this will be a sweet mechanism, with lots of implications across the tools I'm working on.

mdsumner commented 4 years ago

Oh, so one thing to consider - should the "extra" attributes be bundled into a single list column, where each element is a dataframe? Then it's only a do.call(rbind, df$listcol) away and no name handling to do: dplyr::bind_rows(df$listcol, .id = "feature_id") also does the same and makes id-ing the feature easier.

I'm not of a strong opinion, possibly I'd prefer data frames in each element (even if only one column) than a raw list.

dcooley commented 4 years ago

should the "extra" attributes be bundled into a single list column, where each element is a dataframe?

Do you have an example of this, I can't quite visualise what you mean?

mdsumner commented 4 years ago

Took a bit more than I thought, but this is what I mean:

 library(sfheaders)
 d0 <- data.frame(x = 1:10, y = 1:10, linestring_id = 1)

 d <- rbind(d0, 2 * d0)
 d$temp <- sort(c(runif(nrow(d)/2),
                  runif(nrow(d)/2, 2, 5)))
 d$fac <- sample(5:15, nrow(d), replace = TRUE)
 d$time <- Sys.time() + seq_len(nrow(d)) * 2

 datcols <- c("temp", "fac", "time")
 ## current scheme for list_columns = c("temp", "fac", "time")
 x <- sf_linestring(d, x = "x", y = "y", linestring_id = "linestring_id", 
                    keep = TRUE,
                    list_columns = datcols)
 ## each are separate lists
 x$temp
#> [[1]]
#>  [1] 0.1012976 0.1808249 0.3301957 0.3396338 0.3452928 0.4666439 0.5471586
#>  [8] 0.5992981 0.7777070 0.7850789
#> 
#> [[2]]
#>  [1] 2.435679 2.778072 2.801986 2.979986 3.047628 3.705985 3.831921 4.083824
#>  [9] 4.682265 4.910941
 x$fac 
#> [[1]]
#>  [1]  6  8  9 10 15 15 15  9  9  9
#> 
#> [[2]]
#>  [1]  7  7 15  6 10  8 13  9 15  9
 x$time
#> [[1]]
#>  [1] "2020-03-29 09:49:23 AEDT" "2020-03-29 09:49:25 AEDT"
#>  [3] "2020-03-29 09:49:27 AEDT" "2020-03-29 09:49:29 AEDT"
#>  [5] "2020-03-29 09:49:31 AEDT" "2020-03-29 09:49:33 AEDT"
#>  [7] "2020-03-29 09:49:35 AEDT" "2020-03-29 09:49:37 AEDT"
#>  [9] "2020-03-29 09:49:39 AEDT" "2020-03-29 09:49:41 AEDT"
#> 
#> [[2]]
#>  [1] "2020-03-29 09:49:43 AEDT" "2020-03-29 09:49:45 AEDT"
#>  [3] "2020-03-29 09:49:47 AEDT" "2020-03-29 09:49:49 AEDT"
#>  [5] "2020-03-29 09:49:51 AEDT" "2020-03-29 09:49:53 AEDT"
#>  [7] "2020-03-29 09:49:55 AEDT" "2020-03-29 09:49:57 AEDT"
#>  [9] "2020-03-29 09:49:59 AEDT" "2020-03-29 09:50:01 AEDT"

 ## but 1) could be a nested data frame
 x2 <- x
 x2$data <- split(d[datcols], d$linestring_id)[unique(d$linestring_id)]

 ## (or 2) a list, closer to current scheme)
 #x2$data <- lapply(split(d[datcols], d$linestring_id)[unique(d$linestring_id)], as.list)
 x2[datcols] <- NULL

## each separate column is bundled in the one list-column

x2$data
#> $`1`
#>         temp fac                time
#> 1  0.1012976   6 2020-03-29 09:49:23
#> 2  0.1808249   8 2020-03-29 09:49:25
#> 3  0.3301957   9 2020-03-29 09:49:27
#> 4  0.3396338  10 2020-03-29 09:49:29
#> 5  0.3452928  15 2020-03-29 09:49:31
#> 6  0.4666439  15 2020-03-29 09:49:33
#> 7  0.5471586  15 2020-03-29 09:49:35
#> 8  0.5992981   9 2020-03-29 09:49:37
#> 9  0.7777070   9 2020-03-29 09:49:39
#> 10 0.7850789   9 2020-03-29 09:49:41
#> 
#> $`2`
#>        temp fac                time
#> 11 2.435679   7 2020-03-29 09:49:43
#> 12 2.778072   7 2020-03-29 09:49:45
#> 13 2.801986  15 2020-03-29 09:49:47
#> 14 2.979986   6 2020-03-29 09:49:49
#> 15 3.047628  10 2020-03-29 09:49:51
#> 16 3.705985   8 2020-03-29 09:49:53
#> 17 3.831921  13 2020-03-29 09:49:55
#> 18 4.083824   9 2020-03-29 09:49:57
#> 19 4.682265  15 2020-03-29 09:49:59
#> 20 4.910941   9 2020-03-29 09:50:01

Created on 2020-03-29 by the reprex package (v0.3.0)

mdsumner commented 4 years ago

fwiw, it's what nest() does

library(tidyverse)
 d %>% select(-x, -y) %>% group_by(linestring_id) %>% nest(data = datcols)
# A tibble: 2 x 2
# Groups:   linestring_id [2]
  linestring_id data             
          <dbl> <list>           
1             1 <tibble [10 × 3]>
2             2 <tibble [10 × 3]>

And it's what unjoin() does, but by separating and de-duplicating the coords if posssible (a core of silicate):

unjoin::unjoin(d, x, y, linestring_id)
SymbolixAU commented 4 years ago

Ok I see what you're suggesting.

However, my argument for keeping them separate is so you can run colourvalues over each individual list.

olourvalues::colour_values(x$temp)
[[1]]
 [1] "#440154FF" "#45065AFF" "#460A5DFF" "#471265FF" "#48176AFF" "#482273FF" "#472D7BFF" "#46337FFF" "#463480FF"
[10] "#433D85FF"

[[2]]
 [1] "#26818EFF" "#25858EFF" "#20928CFF" "#1F988BFF" "#1E9D89FF" "#57C766FF" "#B2DD2DFF" "#B3DD2DFF" "#C8E020FF"
[10] "#FDE725FF"

Which means it can go straight into mapdeck ;)

mdsumner commented 4 years ago

Ah ok, no worries. Appreciate the discussion has been clarifying :)

SymbolixAU commented 4 years ago

And thinking about it more, I see things like 'nesting' as very much in the realm of other packages (dplyr, data.table), and I think it would be a fair bit of effort to get it "right", when they already do it.

library(data.table)

dt <- as.data.table( d )
dt[
  , .(
    geometry = sfheaders::sfc_linestring(obj = dt, x = "x", y = "y")
    , data = list(.SD)
  )
  , .SDcols = setdiff(names(dt), c("x","y"))
  , by = .(linestring_id)
]
#    linestring_id                           geometry         data
# 1:             1 LINESTRING (1 1, 2 2, 3 3, ...,... <data.table>
# 2:             2 LINESTRING (1 1, 2 2, 3 3, ...,... <data.table>

Out of interest, what would be the dplyr equivalent of this data.table code?

I don't really speak dplyr, so I've got as far as

library(tidyverse)
d %>% 
  group_by(linestring_id) %>% 
  summarise(
    geometry = sfheaders::sfc_linestring(., "x", "y")
    )
SymbolixAU commented 4 years ago

But thinking some more... maybe it wouldn't be so hard...

We could have a list_data argument, as well as list_columns, which will combine the lists into your suggested data ...

SymbolixAU commented 4 years ago

yeah ok, let's do it

mdsumner commented 4 years ago

haha can't keep up, this does it (but is a bit complicated atm with upcoming dplyr/tibble changes)

  library(sfheaders)
d0 <- data.frame(x = 1:10, y = 1:10, linestring_id = 1)

d <- rbind(d0, 2 * d0)
d$temp <- sort(c(runif(nrow(d)/2),
                 runif(nrow(d)/2, 2, 5)))
d$fac <- sample(5:15, nrow(d), replace = TRUE)
d$time <- Sys.time() + seq_len(nrow(d)) * 2

datcols <- c("temp", "fac", "time")
## current scheme for list_columns = c("temp", "fac", "time")
x <- sf_linestring(d, x = "x", y = "y", linestring_id = "linestring_id", 
                   keep = TRUE,
                   list_columns = datcols)

library(tidyverse)
d %>% 
  group_by(linestring_id) %>% 
  summarise(
    geometry = sfheaders::sfc_linestring(., "x", "y"), 
    data  = list(select(ungroup(.), datcols))
  )
#> # A tibble: 2 x 3
#>   linestring_id geometry data             
#>           <dbl> <list>   <list>           
#> 1             1 <XY>     <tibble [20 × 3]>
#> 2             2 <XY>     <tibble [20 × 3]>

Created on 2020-03-30 by the reprex package (v0.3.0)