KyleHaynes / gnaf.r

An R package for downloading / importing / manipulating G-NAF
4 stars 2 forks source link
australia australian-addresses data-science geocoding r

gnaf.r

An R package to assist with the downloading / importing / manipulation of the (Australian) Geocoded - National Address File (G-NAF).

Addresses are a cultural artefact, created from language rather than rules and legislation *

What is G-NAF?

G-NAF is Australia's most trusted authoritative (g)eocoded - (n)ational (a)ddress (f)ile.

More from: https://psma.com.au/product/gnaf/

PSMA's G-NAF dataset contains all physical addresses in Australia. It's the most trusted source of geocoded addresses for Australian businesses and governments.

Before use, users should read the G-NAF End User Licence Agreement

Where to get G-NAF?

G-NAF is released on a quarterly basis and is available from here.

Dependencies required for this package

Installation of the package

Please note, the package is not on CRAN.

Installing from GitHub:

# Install `remotes` if it isn't already installed.
if(!any(installed.packages()[,1] == "remotes")) install.packages("remotes")

# Install the `gnaf.r` package.
remotes::install_github("KyleHaynes/gnaf.r")

Basic usage

Prerequisite steps

The following three steps can be completed manually or with the function call get_gnaf() (see below example).

  1. Download G-NAF from data.gov.au: https://data.gov.au/dataset/ds-dga-19432f89-dc3a-4ef3-b943-5326ef1dbecc/details?q=G-NAF
    • NOTE: File size is ~1.5GB compressed / ~7.7GB uncompressed.
  2. Extract the content of the compressed download to a desired location.
  3. Note down the location of the extracted directory (and the sibling month/year folder). E.g. "C:/temp/G-NAF/G-NAF FEBRUARY 2020".

From R

# Load the package.
library("gnaf.r")

# Steps 1-3 in the `Prerequisite steps` section above can be completed from within R.
    # Note: If G-NAF is already downloaded, you can skip this function call.
# Download and unpack G-NAF to the "c:/temp/" folder.
get_gnaf(dest_folder = "c:/temp")
# Verbose output example:
    # ------------------
    # The download is approximately 1.5Gb, depending on your internet speed, the 
    # following may take a while.
    # The G-NAF zip file is currently being downloaded to: C:\temp\feb20_gnaf_pipeseparatedvalue.zip
    # ------------------
    # G-NAF has been download and is now uncompressing.
    # ------------------
    # You can now call the `setup()` to begin the initial setup of G-NAF. Be sure to toggle the
    # `states` argument to only import relevant jurisdictions.

    # Example setup call: setup(dir = "C:\\temp\\G-NAF\\G-NAF NOVEMBER 2020", states = "qld")

# Setup the session before importing G-NAF. This step has two primary purposes.
    # 1. Define the location of the G-NAF (month year) root path (./G-NAF <MONTH> <YEAR>).
    # 2. Define which jurisdictions to import (case insensitive regex on State abbreviations).
setup(dir = "C:/temp/G-NAF/G-NAF NOVEMBER 2020", states = "qld")

# Import G-NAF for Queensland.
gnaf <- build_gnaf()

# Import again, defining `simple = TRUE` to remove potential non-address related
# variables (i.e reduce the output to just address information).
gnaf_simple <- build_gnaf(simple = TRUE)

# Inspect the stucture of each object.
str(gnaf)
    # Classes ‘data.table’ and 'data.frame':  590395 obs. of  48 variables:
    #  $ ADDRESS_DETAIL_PID                   : chr  "GAACT714845933" "GAACT714845934" "GAACT714845935" "GAACT714845936" ...
    #  $ BUILDING_NAME                        : chr  "" "" "" "" ...
    #  $ LOT_NUMBER                           : int  NA NA NA NA NA NA NA NA NA NA ...
    #  $ FLAT_NUMBER_PREFIX                   : chr  "" "" "" "" ...
    #  $ FLAT_TYPE                            : chr  NA NA NA NA ...
    #  $ FLAT_NUMBER                          : int  NA NA NA NA NA NA NA NA NA NA ...
    #  $ FLAT_NUMBER_SUFFIX                   : chr  "" "" "" "" ...
    #  $ LEVEL_TYPE                           : chr  NA NA NA NA ...
    #  $ LEVEL_NUMBER_PREFIX                  : chr  "" "" "" "" ...
    #  $ LEVEL_NUMBER                         : int  NA NA NA NA NA NA NA NA NA NA ...
    #  $ NUMBER_FIRST_PREFIX                  : chr  NA NA NA NA ...
    #  $ NUMBER_FIRST                         : int  6 3 26 17 5 24 7 5 22 9 ...
    #  $ NUMBER_FIRST_SUFFIX                  : chr  "" "" "" "" ...
    #  $ NUMBER_LAST                          : int  NA NA NA NA NA NA NA NA NA NA ...
    #  $ NUMBER_LAST_SUFFIX                   : chr  NA NA NA NA ...
    #  $ STREET_NAME                          : chr  "PACKHAM" "BUNKER" "JAUNCEY" "GEEVES" ...
    #  $ STREET_TYPE                          : chr  "PLACE" "PLACE" "COURT" "COURT" ...
    #  $ STREET_SUFFIX                        : chr  NA NA NA NA ...
    #  $ LOCALITY_NAME                        : chr  "CHARNWOOD" "CHARNWOOD" "CHARNWOOD" "CHARNWOOD" ...
    #  $ STATE_NAME                           : chr  "AUSTRALIAN CAPITAL TERRITORY" "AUSTRALIAN CAPITAL TERRITORY" "AUSTRALIAN CAPITAL TERRITORY" "AUSTRALIAN CAPITAL TERRITORY" ...
    #  $ POSTCODE                             : int  2615 2615 2615 2615 2902 2615 2902 2615 2615 2902 ...
    #  $ LONGITUDE                            : num  149 149 149 149 149 ...
    #  $ LATITUDE                             : num  -35.2 -35.2 -35.2 -35.2 -35.4 ...
    #  $ MB_2011_CODE                         : chr  "80006300000" "80006310000" "80006380000" "80006280000" ...
    #  $ MB_2016_CODE                         : chr  "80006300000" "80006310000" "80006380000" "80006280000" ...
    #  $ STREET_LOCALITY_PID                  : chr  "ACT3857" "ACT3807" "ACT3833" "ACT3826" ...
    #  $ LOCALITY_PID                         : chr  "ACT570" "ACT570" "ACT570" "ACT570" ...
    #  $ ALIAS_PRINCIPAL                      : chr  "P" "P" "P" "P" ...
    #  $ LEGAL_PARCEL_ID                      : chr  "BELC/CHAR/15/16/" "BELC/CHAR/17/2/" "BELC/CHAR/83/3/" "BELC/CHAR/29/9/" ...
    #  $ CONFIDENCE                           : int  2 2 2 2 2 2 2 2 2 2 ...
    #  $ ADDRESS_SITE_PID                     : int  710446419 710446420 710446421 710446422 710446424 710446425 710446427 710446428 710446429 710446430 ...
    #  $ LEVEL_GEOCODED_CODE                  : int  7 7 7 7 7 7 7 7 7 7 ...
    #  $ GNAF_PROPERTY_PID                    : chr  "1026280" "1026283" "351430" "343650" ...
    #  $ PRIMARY_SECONDARY                    : chr  "" "" "" "" ...
    #  $ PRIMARY_POSTCODE                     : int  NA NA NA NA NA NA NA NA NA NA ...
    #  $ GNAF_LOCALITY_PID                    : int  500219587 500219587 500219587 500219587 500219628 500219587 500219628 500219587 500219587 500219628 ...
    #  $ GNAF_RELIABILITY_CODE                : int  5 5 5 5 5 5 5 5 5 5 ...
    #  $ GNAF_STREET_PID                      : int  502493439 502490407 502492206 502491587 502492926 502492206 502492926 502490407 502492206 502492926 ...
    #  $ GNAF_STREET_CONFIDENCE               : int  2 2 2 -1 2 2 2 2 2 2 ...
    #  $ GNAF_RELIABILITY_CODE_street_locality: int  4 4 4 4 4 4 4 4 4 4 ...
    #  $ ADDRESS_DEFAULT_GEOCODE_PID          :integer64 3006501997 3006502410 3006610521 3006506877 3006499300 3006448778 3006616267 3006485909 ... 
    #  $ GEOCODE_TYPE_CODE                    : chr  "FCS" "FCS" "FCS" "FCS" ...
    #  $ ADDRESS_MESH_BLOCK_2011_PID          : chr  "ACT43994755" "ACT43994756" "ACT43994757" "ACT43994758" ...
    #  $ MB_MATCH_CODE                        : int  1 1 1 1 1 1 1 1 1 1 ...
    #  $ ADDRESS_MESH_BLOCK_2016_PID          : chr  "ACT1547490736" "ACT1547490737" "ACT1547490738" "ACT1547490739" ...
    #  $ MB_MATCH_CODE_locality               : int  1 1 1 1 1 1 1 1 1 1 ...
    #  $ LOCALITY_CLASS                       : chr  "GAZETTED LOCALITY" "GAZETTED LOCALITY" "GAZETTED LOCALITY" "GAZETTED LOCALITY" ...
    #  $ STREET_CLASS                         : chr  "CONFIRMED" "CONFIRMED" "CONFIRMED" "CONFIRMED" ...
    #  - attr(*, ".internal.selfref")=<externalptr> 
    #  - attr(*, "sorted")= chr "ADDRESS_DETAIL_PID"

str(gnaf_simple)
    # Classes ‘data.table’ and 'data.frame':  590395 obs. of  25 variables:
    #  $ ADDRESS_DETAIL_PID : chr  "GAACT714845933" "GAACT714845934" "GAACT714845935" "GAACT714845936" ...
    #  $ BUILDING_NAME      : chr  "" "" "" "" ...
    #  $ LOT_NUMBER         : int  NA NA NA NA NA NA NA NA NA NA ...
    #  $ FLAT_NUMBER_PREFIX : chr  "" "" "" "" ...
    #  $ FLAT_TYPE          : chr  NA NA NA NA ...
    #  $ FLAT_NUMBER        : int  NA NA NA NA NA NA NA NA NA NA ...
    #  $ FLAT_NUMBER_SUFFIX : chr  "" "" "" "" ...
    #  $ LEVEL_TYPE         : chr  NA NA NA NA ...
    #  $ LEVEL_NUMBER_PREFIX: chr  "" "" "" "" ...
    #  $ LEVEL_NUMBER       : int  NA NA NA NA NA NA NA NA NA NA ...
    #  $ NUMBER_FIRST_PREFIX: chr  NA NA NA NA ...
    #  $ NUMBER_FIRST       : int  6 3 26 17 5 24 7 5 22 9 ...
    #  $ NUMBER_FIRST_SUFFIX: chr  "" "" "" "" ...
    #  $ NUMBER_LAST        : int  NA NA NA NA NA NA NA NA NA NA ...
    #  $ NUMBER_LAST_SUFFIX : chr  NA NA NA NA ...
    #  $ STREET_NAME        : chr  "PACKHAM" "BUNKER" "JAUNCEY" "GEEVES" ...
    #  $ STREET_TYPE        : chr  "PLACE" "PLACE" "COURT" "COURT" ...
    #  $ STREET_SUFFIX      : chr  NA NA NA NA ...
    #  $ LOCALITY_NAME      : chr  "CHARNWOOD" "CHARNWOOD" "CHARNWOOD" "CHARNWOOD" ...
    #  $ STATE_NAME         : chr  "AUSTRALIAN CAPITAL TERRITORY" "AUSTRALIAN CAPITAL TERRITORY" "AUSTRALIAN CAPITAL TERRITORY" "AUSTRALIAN CAPITAL TERRITORY" ...
    #  $ POSTCODE           : int  2615 2615 2615 2615 2902 2615 2902 2615 2615 2902 ...
    #  $ LONGITUDE          : num  149 149 149 149 149 ...
    #  $ LATITUDE           : num  -35.2 -35.2 -35.2 -35.2 -35.4 ...
    #  $ MB_2011_CODE       : chr  "80006300000" "80006310000" "80006380000" "80006280000" ...
    #  $ MB_2016_CODE       : chr  "80006300000" "80006310000" "80006380000" "80006280000" ...
    #  - attr(*, ".internal.selfref")=<externalptr> 
    #  - attr(*, "sorted")= chr "ADDRESS_DETAIL_PID"

# Size of each object (gigabytes).
format(object.size(gnaf), units = "Gb")
# [1] "0.3 Gb"
format(object.size(gnaf_simple), units = "Gb")
# [1] "0.1 Gb"

# Attempt to build the entire country (including Other Territories: "OT").
setup(dir = "C:/temp/G-NAF/G-NAF FEBRUARY 2020", states = "")

# Import all jurisdictions.
gnaf <- build_gnaf()

# Dimensions of output.
dim(gnaf)
# [1] 15271641       52

# Object size.
format(object.size(gnaf), units = "Gb")
# [1] "8.9 Gb"

# Frequency table by State.
gnaf[, .N, STATE_NAME]
    #                      STATE_NAME       N
    # 1: AUSTRALIAN CAPITAL TERRITORY  242999
    # 2:              NEW SOUTH WALES 4749707
    # 3:           NORTHERN TERRITORY  113221
    # 4:            OTHER TERRITORIES    4362
    # 5:                   QUEENSLAND 3219900
    # 6:              SOUTH AUSTRALIA 1163320
    # 7:                     TASMANIA  347396
    # 8:                     VICTORIA 3886769
    # 9:            WESTERN AUSTRALIA 1543967

Other

Issues / Bugs / Suggestions: https://github.com/KyleHaynes/gnaf.r/issues

Data Licenses / Attribution

G-NAF ©PSMA Australia Limited licensed by the Commonwealth of Australia under the Open Geo-coded National Address File (G-NAF) End User Licence Agreement.

Incorporates or developed using G-NAF ©PSMA Australia Limited licensed by the Commonwealth of Australia under the Open Geo-coded National Address File (G-NAF) End User Licence Agreement.

Special thanks to the Turnbull Government for the innovative and invaluable step in making this data open to all Australians.