Closed dcooley closed 4 years ago
@mdsumner I've added a new argument to the sf_
functions called list_columns
, which you use to specify any columns of the df
where you want to keep all the values
df <- data.frame(
id = c(1,1,2,2)
, x = 1:4
, y = 1:4
, val = letters[1:4]
, stringsAsFactors = FALSE
)
sf_linestring(
obj = df
, x = "x"
, y = "y"
, linestring_id = "id"
, keep = TRUE
, list_columns = "val"
)
# id val geometry
# 1 1 a, b 1, 2, 1, 2
# 2 2 c, d 3, 4, 3, 4
vs the current behaviour
sf_linestring(
obj = df
, x = "x"
, y = "y"
, linestring_id = "id"
, keep = TRUE
)
## here in 'val', only the first row of the `df$val` is kept
# id val geometry
# 1 1 a 1, 2, 1, 2
# 2 2 c 3, 4, 3, 4
Do you foresee any issues if I add this arg to your sf-helper
functions?
That's interesting. I don't think it can be a problem, but I'd need to try it with a few things.
It's strange one, effectively you are allow vertex attributes beyond x, y, z, m - by storing them one-to-one "elsewhere" (which is what every spatial system does when butting up against the limitations of SF).
This is very complementary to PATH in silicate, it tries to keep these vertex attributes and in theory allows for any set to be stored in the $vertex
table - but it's only really functional for track data (with date-time, depth, temperature, whatever was measured) and not mature. The issue is $vertex is normalized (deduplicated) in some geometry xy, or xyz, or xyt etc and so other attributes (all in a higher dimensionality along with x,y that has no inherent duplication) end up on the $path_link_vertex table
. It's powerful I think but not developed.
(Heh, sorry, but this might help bridge that model).
I'll try to have a look soonish. Thanks for the heads up!
Oh, do you have a real-world example data set that's relevant? Pretty much any tracking data is, and (not a great example but relevant) the walrus818
in trip:
d <- tibble::as_tibble(trip::walrus818)
# A tibble: 10,558 x 6
Deployment DataDT Wet Forage X_AED170_70 Y_AED170_70
<int> <dttm> <int> <int> <dbl> <dbl>
1 353 2009-09-15 04:00:00 1 0 281017 22532
2 353 2009-09-15 05:00:00 0 0 281399 22392
...
sf_linestring(d, x = "X_AED170_70", y = "Y_AED170_70", z = "DataDT", linestring_id = "Deployment")
Here list_columns
would be c("Wet", "Forage")
.
Q: does keep
remain? So we don't have to specify the columns, "just do it"?
crs for Walrus818
fwiw:
crsmeta::crs_proj(trip::walrus818)
[1] "+proj=aeqd +ellps=WGS84 +lon_0=-170 +lat_0=70"
Q: does
keep
remain? So we don't have to specify the columns, "just do it"?
Oh I see, list-columns are separate, and compatible with keep
- very good. ( at first I though you only added one column with 1 or more vectors in each list)
Oh, do you have a real-world example data set that's relevant? Pretty much any tracking data
No, nothing specific, but my use-case is the multi-coloured path and polygon vertices, and multi-coloured trips (e.g. to represent loads on public transport)
effectively you are allow vertex attributes beyond x, y, z, m
yep, exactly.
All good! Had a bit of refresh on this concept and what I've tried before, this will be a sweet mechanism, with lots of implications across the tools I'm working on.
Oh, so one thing to consider - should the "extra" attributes be bundled into a single list column, where each element is a dataframe? Then it's only a do.call(rbind, df$listcol)
away and no name handling to do: dplyr::bind_rows(df$listcol, .id = "feature_id")
also does the same and makes id-ing the feature easier.
I'm not of a strong opinion, possibly I'd prefer data frames in each element (even if only one column) than a raw list.
should the "extra" attributes be bundled into a single list column, where each element is a dataframe?
Do you have an example of this, I can't quite visualise what you mean?
Took a bit more than I thought, but this is what I mean:
library(sfheaders)
d0 <- data.frame(x = 1:10, y = 1:10, linestring_id = 1)
d <- rbind(d0, 2 * d0)
d$temp <- sort(c(runif(nrow(d)/2),
runif(nrow(d)/2, 2, 5)))
d$fac <- sample(5:15, nrow(d), replace = TRUE)
d$time <- Sys.time() + seq_len(nrow(d)) * 2
datcols <- c("temp", "fac", "time")
## current scheme for list_columns = c("temp", "fac", "time")
x <- sf_linestring(d, x = "x", y = "y", linestring_id = "linestring_id",
keep = TRUE,
list_columns = datcols)
## each are separate lists
x$temp
#> [[1]]
#> [1] 0.1012976 0.1808249 0.3301957 0.3396338 0.3452928 0.4666439 0.5471586
#> [8] 0.5992981 0.7777070 0.7850789
#>
#> [[2]]
#> [1] 2.435679 2.778072 2.801986 2.979986 3.047628 3.705985 3.831921 4.083824
#> [9] 4.682265 4.910941
x$fac
#> [[1]]
#> [1] 6 8 9 10 15 15 15 9 9 9
#>
#> [[2]]
#> [1] 7 7 15 6 10 8 13 9 15 9
x$time
#> [[1]]
#> [1] "2020-03-29 09:49:23 AEDT" "2020-03-29 09:49:25 AEDT"
#> [3] "2020-03-29 09:49:27 AEDT" "2020-03-29 09:49:29 AEDT"
#> [5] "2020-03-29 09:49:31 AEDT" "2020-03-29 09:49:33 AEDT"
#> [7] "2020-03-29 09:49:35 AEDT" "2020-03-29 09:49:37 AEDT"
#> [9] "2020-03-29 09:49:39 AEDT" "2020-03-29 09:49:41 AEDT"
#>
#> [[2]]
#> [1] "2020-03-29 09:49:43 AEDT" "2020-03-29 09:49:45 AEDT"
#> [3] "2020-03-29 09:49:47 AEDT" "2020-03-29 09:49:49 AEDT"
#> [5] "2020-03-29 09:49:51 AEDT" "2020-03-29 09:49:53 AEDT"
#> [7] "2020-03-29 09:49:55 AEDT" "2020-03-29 09:49:57 AEDT"
#> [9] "2020-03-29 09:49:59 AEDT" "2020-03-29 09:50:01 AEDT"
## but 1) could be a nested data frame
x2 <- x
x2$data <- split(d[datcols], d$linestring_id)[unique(d$linestring_id)]
## (or 2) a list, closer to current scheme)
#x2$data <- lapply(split(d[datcols], d$linestring_id)[unique(d$linestring_id)], as.list)
x2[datcols] <- NULL
## each separate column is bundled in the one list-column
x2$data
#> $`1`
#> temp fac time
#> 1 0.1012976 6 2020-03-29 09:49:23
#> 2 0.1808249 8 2020-03-29 09:49:25
#> 3 0.3301957 9 2020-03-29 09:49:27
#> 4 0.3396338 10 2020-03-29 09:49:29
#> 5 0.3452928 15 2020-03-29 09:49:31
#> 6 0.4666439 15 2020-03-29 09:49:33
#> 7 0.5471586 15 2020-03-29 09:49:35
#> 8 0.5992981 9 2020-03-29 09:49:37
#> 9 0.7777070 9 2020-03-29 09:49:39
#> 10 0.7850789 9 2020-03-29 09:49:41
#>
#> $`2`
#> temp fac time
#> 11 2.435679 7 2020-03-29 09:49:43
#> 12 2.778072 7 2020-03-29 09:49:45
#> 13 2.801986 15 2020-03-29 09:49:47
#> 14 2.979986 6 2020-03-29 09:49:49
#> 15 3.047628 10 2020-03-29 09:49:51
#> 16 3.705985 8 2020-03-29 09:49:53
#> 17 3.831921 13 2020-03-29 09:49:55
#> 18 4.083824 9 2020-03-29 09:49:57
#> 19 4.682265 15 2020-03-29 09:49:59
#> 20 4.910941 9 2020-03-29 09:50:01
Created on 2020-03-29 by the reprex package (v0.3.0)
fwiw, it's what nest() does
library(tidyverse)
d %>% select(-x, -y) %>% group_by(linestring_id) %>% nest(data = datcols)
# A tibble: 2 x 2
# Groups: linestring_id [2]
linestring_id data
<dbl> <list>
1 1 <tibble [10 × 3]>
2 2 <tibble [10 × 3]>
And it's what unjoin()
does, but by separating and de-duplicating the coords if posssible (a core of silicate):
unjoin::unjoin(d, x, y, linestring_id)
Ok I see what you're suggesting.
However, my argument for keeping them separate is so you can run colourvalues
over each individual list.
olourvalues::colour_values(x$temp)
[[1]]
[1] "#440154FF" "#45065AFF" "#460A5DFF" "#471265FF" "#48176AFF" "#482273FF" "#472D7BFF" "#46337FFF" "#463480FF"
[10] "#433D85FF"
[[2]]
[1] "#26818EFF" "#25858EFF" "#20928CFF" "#1F988BFF" "#1E9D89FF" "#57C766FF" "#B2DD2DFF" "#B3DD2DFF" "#C8E020FF"
[10] "#FDE725FF"
Which means it can go straight into mapdeck ;)
Ah ok, no worries. Appreciate the discussion has been clarifying :)
And thinking about it more, I see things like 'nesting' as very much in the realm of other packages (dplyr, data.table), and I think it would be a fair bit of effort to get it "right", when they already do it.
library(data.table)
dt <- as.data.table( d )
dt[
, .(
geometry = sfheaders::sfc_linestring(obj = dt, x = "x", y = "y")
, data = list(.SD)
)
, .SDcols = setdiff(names(dt), c("x","y"))
, by = .(linestring_id)
]
# linestring_id geometry data
# 1: 1 LINESTRING (1 1, 2 2, 3 3, ...,... <data.table>
# 2: 2 LINESTRING (1 1, 2 2, 3 3, ...,... <data.table>
Out of interest, what would be the dplyr
equivalent of this data.table
code?
I don't really speak dplyr
, so I've got as far as
library(tidyverse)
d %>%
group_by(linestring_id) %>%
summarise(
geometry = sfheaders::sfc_linestring(., "x", "y")
)
But thinking some more... maybe it wouldn't be so hard...
We could have a list_data
argument, as well as list_columns
, which will combine the lists into your suggested data
...
yeah ok, let's do it
haha can't keep up, this does it (but is a bit complicated atm with upcoming dplyr/tibble changes)
library(sfheaders)
d0 <- data.frame(x = 1:10, y = 1:10, linestring_id = 1)
d <- rbind(d0, 2 * d0)
d$temp <- sort(c(runif(nrow(d)/2),
runif(nrow(d)/2, 2, 5)))
d$fac <- sample(5:15, nrow(d), replace = TRUE)
d$time <- Sys.time() + seq_len(nrow(d)) * 2
datcols <- c("temp", "fac", "time")
## current scheme for list_columns = c("temp", "fac", "time")
x <- sf_linestring(d, x = "x", y = "y", linestring_id = "linestring_id",
keep = TRUE,
list_columns = datcols)
library(tidyverse)
d %>%
group_by(linestring_id) %>%
summarise(
geometry = sfheaders::sfc_linestring(., "x", "y"),
data = list(select(ungroup(.), datcols))
)
#> # A tibble: 2 x 3
#> linestring_id geometry data
#> <dbl> <list> <list>
#> 1 1 <XY> <tibble [20 × 3]>
#> 2 2 <XY> <tibble [20 × 3]>
Created on 2020-03-30 by the reprex package (v0.3.0)
the reverse of #69 - keep all the rows and make it a list, where each list element is the length of the
sfg_id
list_columns
can beR_NilValue
(or NULL).I don't want to pass another value through all the
sf_()
functions. So maybe I need to make ansfheaders::api::
which will switch on the type, and will return to it before thecreate_df()
call, so I can then pass in the list_columns...TODO
sf_pointsf_ptsfheaders:::rcpp_sf_
needs to change tosfheaders:::rcpp_to_sf
(including tests)calling- this has to error? If it's not a list, how do we know how to handle it?unlist
on a non-list column works (or doesn't error)