dimfalk / kostra2010R

R interface for KOSTRA-DWD-2010R dataset
GNU General Public License v3.0
2 stars 0 forks source link

`get_centroid()`: tile id determination using postal codes #24

Closed dimfalk closed 1 year ago

dimfalk commented 2 years ago

OSM-based dataset is 55,2 MB in size and therefore cannot be embedded in a package:

https://opendata-esri-de.opendata.arcgis.com/datasets/esri-de-content::postleitzahlengebiete-osm/about https://www.suche-postleitzahl.org/downloads

Save as Rdata or dismiss idea?

dimfalk commented 2 years ago

OSM_PLZ.shp: 55,2 MB (on disk)

plz <- sf::read_sf("inst/exdata/PLZ/OSM_PLZ.shp"): 60,8 MB (in memory) save(plz, file="plz.RData"): 36,5 MB (on disk) saveRDS(plz, file="plz.rds"): 36,5 MB (on disk)

Volume overhead seems too large, the actual benefit marginal, but the insights can be used for #26.

dimfalk commented 2 years ago

However, since only centroid coordinates are relevant for point extraction, actual geometries can be dropped.

plz_centroids <- sf::st_centroid(plz): 6,0 MB (in memory) | 510 KB (on disk)

Moreover, attribute table can be cleaned for not relevant columns.

plz_minimal <- plz_centroids["plz"]: 4,1 MB | 161 KB (on disk) ✔️

The sf object consists of 8.725 observations but there are only 8.169 unique entries in the dataset. Caution: There are supposed to be 8.181 unique entries. 12 objects are missing.

Overlap between postal code areas and municipalities? Seems more like a multi-polygon approach (for whatever reason) because attributes of twin objects seem to be identical except for OBJECTID, Shape_Length, Shape_Area and geometry.

This would require some cleaning - dplyr::group()? - beforehand.

In addition, it would make sense to use the primary source from OSM.

dimfalk commented 1 year ago
get_centroid("33699") |> get_idx() 
#' "42024"