Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.57k stars 974 forks source link

compatibility with sf library #2273

Closed rafapereirabr closed 2 years ago

rafapereirabr commented 7 years ago

I wonder if it would be possible to make data.table compatible with the new sf library. The library sf is promising to be a game changer for spatial analysis in R so it sounds like good idea to bring together the power of both libraries.

Currently, the class of an sf object is "sf" "data.frame" and it brings a column named geometry of class "sfc_MULTIPOLYGON" "sfc".

Right now, I think the main incompatibility between dt and sf is that when we convert an sf data.frame into a data.table using setDT(), the geometry column is ruined and the object is not recognised anymore as an sf class.

I'm sure there are other points to take into account when making these two great libraries together so I just wanted to put the ball rolling if someone has not done this before.

MichaelChirico commented 7 years ago

My biggest problem in my R pipeline continues to be the trouble spatial libraries have communicating with data.table. I'm not sure how much can be done on the data.table side though...

mattdowle commented 7 years ago

Reproducible example please.

MichaelChirico commented 7 years ago

1310 is related

rafapereirabr commented 7 years ago

Reproducible Example

library(data.table)
library(sf)

# load sf data
  nc <- st_read( system.file("shape/nc.shp", package="sf") )

  class(nc)
  > [1] "sf"         "data.frame"

head(nc)
  > Simple feature collection with 6 features and 14 fields
  > geometry type:  MULTIPOLYGON
  > dimension:      XY
  > bbox:           xmin: -81.74107 ymin: 36.07282 xmax: -75.77316 ymax: 36.58965
  > epsg (SRID):    4267
  > proj4string:    +proj=longlat +datum=NAD27 +no_defs
  >    AREA PERIMETER CNTY_ CNTY_ID        NAME  FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79 NWBIR79                        geometry
  > 1 0.114     1.442  1825    1825        Ashe 37009  37009        5  1091     1      10  1364     0      19 1 MULTIPOLYGON(((-81.47275543...
  > 2 0.061     1.231  1827    1827   Alleghany 37005  37005        3   487     0      10   542     3      12 2 MULTIPOLYGON(((-81.23989105...
  > 3 0.143     1.630  1828    1828       Surry 37171  37171       86  3188     5     208  3616     6     260 3 MULTIPOLYGON(((-80.45634460...
  > 4 0.070     2.968  1831    1831   Currituck 37053  37053       27   508     1     123   830     2     145 4 MULTIPOLYGON(((-76.00897216...
  > 5 0.153     2.206  1832    1832 Northampton 37131  37131       66  1421     9    1066  1606     3    1197 5 MULTIPOLYGON(((-77.21766662...
  > 6 0.097     1.670  1833    1833    Hertford 37091  37091       46  1452     7     954  1838     5    1237 6 MULTIPOLYGON(((-76.74506378...

Now when we try setting the sf object into data.table, note that the object looses both its "sf" class and the information in the geometry column

# set sf into data.table
  setDT(nc)

  class(nc)
  > [1] "data.table" "data.frame"

  head(nc)
  >     AREA PERIMETER CNTY_ CNTY_ID        NAME  FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79 NWBIR79 geometry
  > 1: 0.114     1.442  1825    1825        Ashe 37009  37009        5  1091     1      10  1364     0      19     <XY>
  > 2: 0.061     1.231  1827    1827   Alleghany 37005  37005        3   487     0      10   542     3      12     <XY>
  > 3: 0.143     1.630  1828    1828       Surry 37171  37171       86  3188     5     208  3616     6     260     <XY>
  > 4: 0.070     2.968  1831    1831   Currituck 37053  37053       27   508     1     123   830     2     145     <XY>
  > 5: 0.153     2.206  1832    1832 Northampton 37131  37131       66  1421     9    1066  1606     3    1197     <XY>
  > 6: 0.097     1.670  1833    1833    Hertford 37091  37091       46  1452     7     954  1838     5    1237     <XY>
rafapereirabr commented 7 years ago

For the record, I've also created an issue in the sf github page #428 because the compatibility between the two libraries might request a bit of collaboration from both sides.

etiennebr commented 7 years ago

If you lose the sf class you can still rely on the sfc class, but you have to take care of more stuff. Also, returning a geometry in a data.table seems tricky, but my DT skills are rusty, so there might be a better way. There is no reason it shouldn't work in theory, but in practice some things might have to be ironed. Looking at the tidy verbs implementation should give a good idea of what needs to be reconciled.

library(sf)
#> Linking to GEOS 3.5.1, GDAL 2.1.3, proj.4 4.9.2, lwgeom 2.2.1 r14555
library(data.table)
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
#> Reading layer `nc' from data source `/usr/local/lib/R/site-library/sf/shape/nc.shp' (...)
#> Simple feature collection with 100 features and 14 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> epsg (SRID):    4267
#> proj4string:    +proj=longlat +datum=NAD27 +no_defs
nc <- setDT(nc)

nc[AREA > 0.1, st_area(geometry)]
#> Units: m^2
#>  [1] 1137388604 1423489919 1520740530 1179347051 1232769242 1136287383
#>  [...]
#> [61] 1264050755 2288682896 2181174167 2450241815 2165843695
nc[AREA > 0.1, sum(st_area(st_union(geometry))), by = SID74]
#>     SID74              V1
#>  1:     1  3598847644 m^2
#>  2:     5 11299175600 m^2
#> [...]
#> 20:    15  3850824500 m^2
#> 21:    29  1978619669 m^2
#> 22:    31  2439553215 m^2
#>     SID74              V1

But returning geometries is a problem

nc[, st_union(geometry), by = SID74]
#>     SID74
#>  1:     1
#> [...]
#> 83:    31
#>     SID74
#>                                                                                                                           V1
#>  1:                                                                                                                   <list>
#>  2:                                                                                                                   <list>
[...]
#> 83: -78.86451,-78.91947,-78.95074,-78.97536,-79.00224,-79.00642, 34.47720, 34.45364, 34.44938, 34.39917, 34.38804, 34.36627,
#>                                                                                                                           V1

Another path is to nc <- st_as_sf(setDT(nc)), but there are still problems.

rafapereirabr commented 7 years ago

A quick note that @SymbolixAU has been working on a project creating a spatial.data.table More info HERE. I'm not sure creating a new class is the way forward, but I'm sure there is a good overlap between all these projects.

SymbolixAU commented 7 years ago

FYI, the reason I'm extending the data.table class is because i want to make use of Google's encoded polylines to store shape information. But these polylines can be hundreds of characters, so I wanted a custom print method to handle them.

more info/background on what I'm doing SO: https://stackoverflow.com/questions/44819369/extend-data-table-class-with-custom-print-method blog: https://www.symbolix.com.au/blog-main/x7gdctwdhre678bsplakplkclg8hg4

vlulla commented 7 years ago

The problem @etiennebr is describing is because aggregation of geometry can lead to two different kinds of geometries (POLYGON and MULTIPOLYGON in this case). Try the following:

nc <- st_read(system.file("shape/nc.shp",package="sf"))
nc_DT <- as.data.table(nc)
nc %>% group_by(SID74) %>% summarise(geom = st_union(geometry))
nc_DT[,st_union(geometry),by=SID74][,table(SID74)] # indices where frequency > 1 are multipolygon geometries!

I still don't understand why data.table shows 83 rows for nc_DT[, st_union(geometry),by=SID74]. Regardless of the geometry aggregation issue shouldn't data.table still return only 23 rows here?

vlulla commented 5 years ago

I have realized my error and I know that I have to do nc_DT[, .(st_union(geometry)), by=.(SID74)] but that still doesn't give me a valid object that I can plot!

nc %>% group_by(SID74) %>% summarise(geom=st_union(geometry)) %>% plot  # works!
plot(nc_DT[, .(st_union(geometry)), by=.(SID74)]) # does not work...some weird error!
SymbolixAU commented 5 years ago

@vlulla try

setDT(nc)[, .(st_union(geometry)), by=.(SID74)] %>% sf::st_as_sf() %>% plot
vlulla commented 5 years ago

Thank you @SymbolixAU !! I don't know why I didn't think of this... :-(

mattdowle commented 5 years ago

Am I right in thinking it would still be nice if setDT() retained the sf in the class? Or can this issue be closed now on the data.table side. The solutions above are workarounds? I reran the reproducible example above with 1.12.0 but it's still the same. Note that geometry column is not really lost though. That's just a printing/display difference. It's still all there :

> nc$geometry
Geometry set for 100 features 
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
epsg (SRID):    4267
proj4string:    +proj=longlat +datum=NAD27 +no_defs
First 5 geometries:
MULTIPOLYGON (((-81.47276 36.23436, -81.54084 3...
MULTIPOLYGON (((-81.23989 36.36536, -81.24069 3...
MULTIPOLYGON (((-80.45634 36.24256, -80.47639 3...
MULTIPOLYGON (((-76.00897 36.3196, -76.01735 36...
MULTIPOLYGON (((-77.21767 36.24098, -77.23461 3...
>
SymbolixAU commented 5 years ago

I just tried

setDT(nc)
attr(nc, "class") <- c(attr(nc, "class"), "sf")

and it seems to cause more issues, so not sure if there's any immediate benefit in retaining the sf class.

At the moment I'm happy with workarounds. There's lots which can be done inside the j argument using anonymous functions and the like. And as has been mentioned, you don't lose the sfc class on the geometry column, which is the main component.

mattdowle commented 5 years ago

Instead of putting sf last in the class I tried putting it first : setattr(nc, "class", c("sf", "data.table", "data.frame")) but this still resulted in problems. Perhaps you could try that and see if those problems can be overcome. The root issue is perhaps that sf has its own [.sf method which either masks or is masked by [.data.table depending on whether sf is before or after data.table in class(). The class attribute being c("sf", "data.table", "data.frame") makes sense to me on first glance. data.table could change setDT() to retain sf like that. But then sf would likely need to become data.table-aware (import data.table) and call NextMethod() inside its [.sf if inherts(x,"data.table"). Something like that.

vlulla commented 5 years ago

Where in https://github.com/r-spatial/sf/blob/master/R/sf.R#L299-L335 should this NextMethod() be called? In the else around Line 318?

At one time (when print.sf was not tbl_df aware) I had the following in my .Rprofile but it caused various problems that I did not know how to resolve so it's commented now:

#### print.sf <- function(x, ..., n = ifelse(options("max.print")[[1]] == 99999, 20, options("max.print")[[1]])) {
####   geoms = which(vapply(x,function(col) inherits(col,"sfc"), TRUE))
####   nf = length(x) - length(geoms)
####   app = paste("and", nf, ifelse(nf == 1, "field", "fields"))
####   if (any(!is.na(st_agr(x))))
####     app = paste0(app,"\n","Attribute-geometry relationship: ", sf:::summarize_agr(x))
####   if (length(geoms) > 1)
####     app = paste0(app,"\n","Active geometry column: ", attr(x, "sf_column"))
####   print(st_geometry(x),n=0,what="Simple feature collection with",append=app)
####   if(is.data.table(x)) {
####     ## data.table:::print.data.table(x, ...)
####     NextMethod()
####   } else {
####     print.data.frame(x, ...)
####   }
####   ## print.data.frame(x, ...)
####   invisible(x)
#### }
mattdowle commented 5 years ago

The lines around 318 in [.sf are like this :

class(x) = setdiff(class(x), "sf") # one step down
x = if (missing(j)) {
    if (nargs == 2) # `[`(x,i)
        x[i] # do sth else for tbl?
    else
        x[i, , drop = drop]
} else
    x[i, j, drop = drop]

You don't need to remove sf from class(x) like that. That's what NextMethod() does for you. I'm not sure exactly what you'd like to achieve in sf in terms of print or [ methods so I'm not sure what exactly to be done. If you can provide some inputs and expected outputs then I could maybe help out.

jangorecki commented 3 years ago

Since last Matt's reply almost 2 years ago there was no feedback. Also previously mentioned issues were addressed in comments by workarounds or suggestions on how to improve integration on sf side. This is quite popular issue so I am not closing it yet but encourage anyone interested into it, to read it, and provide comment what compatibility improvement could be added on data.table side, providing an example of course.

jplecavalier commented 3 years ago

I personally am a massive user of data.table and sf and the best way to integrate them together is to use sfc column inside a data.table instead of using an sf object. Everything integrates very well, the only known issue for me is #4217 and I suggested a temporary workaround in the issue.

CWen001 commented 3 years ago

Maybe this thread helps https://github.com/paleolimbot/wk/issues/12. It seems that (1) data.table fully supports vctrs, or that (2) data.table supports geovctrs (wk) that are the bridge to sf can be potential solutions.

grantmcdermott commented 3 years ago

As Matt noted above:

That's just a printing/display difference. It's still all there.

With @MichaelChirico's recent PR to enable custom printing for list column types (https://github.com/Rdatatable/data.table/pull/3414), I'm wondering whether we could turn some special cases like sfg on by default. For instance:

library(data.table)
library(sf)
#> Linking to GEOS 3.9.1, GDAL 3.3.1, PROJ 8.0.1

nc = st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)
setDT(nc)

format_list_item.sfg = function(x, ...) format(x)
nc[1:5, .(NAME, geometry)]
#>           NAME                       geometry
#> 1:        Ashe MULTIPOLYGON (((-81.47276 3...
#> 2:   Alleghany MULTIPOLYGON (((-81.23989 3...
#> 3:       Surry MULTIPOLYGON (((-80.45634 3...
#> 4:   Currituck MULTIPOLYGON (((-76.00897 3...
#> 5: Northampton MULTIPOLYGON (((-77.21767 3...

Created on 2021-08-30 by the reprex package (v2.0.1)

JoshOBrien commented 2 years ago

@grantmcdermott Based on your nice suggestion, I've prepared a PR for the sf package that will produce pretty-printed sfg columns whenever both sf and data.table are loaded. Because format_list_item() is so far only available in the development version of data.table, though, I'll wait to submit the PR until data.table v1.14.4 is available on CRAN.

grantmcdermott commented 2 years ago

Great, thanks @JoshOBrien!

(Aside: shouldn't the PR be on the data.table side, though?)

JoshOBrien commented 2 years ago

@grantmcdermott That's a great question, and I honestly don't know the answer. FTR, here is the commit I would make into a PR if I do this via sf. Following that commit, an sfg column in a data.table gets printed like this:

library(sf)
library(data.table)
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
DT <- data.table(nc)
DT[1:3, .(NAME, FIPS, geometry)]
##         NAME  FIPS                       geometry
## 1:      Ashe 37009 MULTIPOLYGON (((-81.47276 3...
## 2: Alleghany 37005 MULTIPOLYGON (((-81.23989 3...
## 3:     Surry 37171 MULTIPOLYGON (((-80.45634 3...

@jangorecki Do you have any opinion about whether I should submit a PR adding a format_list_item() method format_list_item.sfg() on the sf or the data.table side?

jangorecki commented 2 years ago

I think it may fit better to sf package, then it may be reused in other packages as well.

JoshOBrien commented 2 years ago

@jangorecki Sounds good. I'll drop another note here once data.table v1.14.4 is out if my PR then gets pulled into the sf package.

MichaelChirico commented 2 years ago

Certainly my intention when writing the PR was that downstream packages would define methods for their own classes. sf is in a slightly unique case where it doesn't Depend/Import/Suggest data.table, so it's a bit more of a toss-up... still, I think hosting it in sf would be more appropriate.

If they are unwilling to merge we could do so here, since IINM we don't need to add any dependency, just to create an .sfg method.

I had a look at your PR, it looks quite elaborate for just adding a method! I am assuming because of the invisible-line dependency issue.

But the meat of your PR is very simple:

format_list_item.sfg = function(x, ...) {
    format.sfg(x)
}

Maybe format_list_item.default could check if there's a format method registered for the class of the object, and run that as a default, and only lacking that back up to the current <{class}[dim]> approach.

I'm not great with S3 registration stuff, but if this sounds easy to you, a PR would be great.

grantmcdermott commented 2 years ago

Hmmm. Looks I'm in the minority here, thinking that an internal data.table method is the simplest and cleanest thing to do. As Michael (and my earlier post) suggest, it would basically require a single (unexported) line of code.

format_list_item.sfg = function(x, ...) format(x)

Similarly, I don't think this method would change the behaviour for any other data frame(ish) object on the sf side. Both tibbles and vanilla data frames would keep their existing print behaviour. So I'm not sure there's much incentive on their end?

Final argument for keeping this on the data.table side: Converting an sf object to a data.table must not only be done explicitly, but also necessitates a slight change in workflow. (As documented in this thread, we need to refer to the geometry column explicitly when performing spatial operations, e.g. st_union(nc) vs nc_dt[, st_union(geometry)].) That's a perfectly acceptable trade-off from my perspective, for those of us that want to combine the power of the two packages. But the point is we're selecting into this modified workflow. IMHO the print method is just a small (but much appreciated!) convenience layer from the data.table side further enabling this integration.

JoshOBrien commented 2 years ago

Now that I've thought about this a bit more, I think I'm with @grantmcdermott on this one.

It'd be really nice to be able to sidestep the elaborate hoops I go through in that sf-side PR just to be able to register the method without adding data.table to sf as an Imports or Depends. As both @grantmcdermott and @MichaelChirico note, adding the method on the data.table side would require just three lines of code and no additional dependencies.

FTR, I looked into @MichaelChirico suggestion that we could possibly modify the code of data.table:::format_list_item.default(). That solution (as implemented by the code below) works just fine for a simple feature geometry. Unfortunately, it's not a great general approach, since it'll cause all sorts of problems when list items have format methods that don't produce such nicely succinct output For example, it fails (throwing an error) when printing a data.table such as this: x <- data.table(list(mtcars, mtcars))

## Alternative implementation, that uses each list element's class to dispatch a format method
format_list_item.default <- function (x, ...)
{
    exists_format_method <- function(x) {
        !(is.null(getS3method("format", class = x, optional = TRUE)))
    }
    if (is.null(x))
        ""
    else if (is.atomic(x) || inherits(x, "formula"))
        paste(c(format(head(x, 6L), ...), if (length(x) > 6L) "..."),
               collapse = ",")
    else if (any(sapply(class(x), exists_format_method)))
        format(x)
    else
        paste0("<", class(x)[1L], paste_dims(x), ">")
}
MichaelChirico commented 2 years ago

Thanks for investigating Josh. point taken... I am only trying to avoid a maintenance headache if dozens of formatters are eventually requested in data.table.

the most scalable thing seems to me to be keeping a list of classes where the format method plays well.

but I think that can be kicked down the road a bit. I would be fine with a simple PR to data.table as described (with NEWS and mention it in the .Rd)

JoshOBrien commented 2 years ago

@MichaelChirico That makes total sense, and I like your idea of, in the longer term, using a list to record classes with format methods that produce reasonable print.data.table-ready output.

In the PR I just pushed, I added this as a new item in NEWS, but could also see folding it into NEWS item 17, which announced the format_list_item() generic.

Also, I didn't add an example to the *.Rd file, only because I didn't want to induce any dependency on sf or trigger any CRAN-check warnings. Let me know if you think an example would be good, though, and I'll see how best to add one that'll satisfy CRAN's requirements.