Closed wniroshan closed 6 years ago
If we could also update the help message (currently says what is below), to include a better description of what each option actual does, esp wrt household and persons distributions. Also --output
should be --outputDir
or similar. Check the others too to make sure their intent is clear.
#!shell
Usage: ./sa2preprocess.R [options]
Options:
-hi CHARACTER, --householdinput=CHARACTER
Household data from ABS[default= ../data/latch/raw/SA2, NPRD and HCFMD.csv]
-ii CHARACTER, --individualinput=CHARACTER
Individual data from ABS[default= ../data/latch/raw/SA2, RLHP Relationship in Household, SEXP and AGE5P.csv]
-sa1tosa2 CHARACTER, --sa1bysa2home=CHARACTER
Household distribution in SA1 by SA2s [default= ../data/latch/raw/Hh-SA1-in-each-SA2/]
-o CHARACTER, --output=CHARACTER
output file location [default= ../data/latch/absprocessed/SA2/]
-sa2 CHARACTER, --sa2list=CHARACTER
list of SA2s to process [default= Alphington - Fairfield,Northcote,Thornbury,Bundoora - East,Greensborough,Heidelberg - Rosanna,Heidelberg West,Ivanhoe,Ivanhoe East - Eaglemont,Montmorency - Briar Hill,Viewbank - Yallambie,Watsonia,Kingsbury,Preston,Reservoir - East,Reservoir - West]
-h, --help
Show this help message and exit
We should also have a set of tests (automated on Travis) for the various functions that this pre-processing script performs.
Add description for script after Usage:
and before Options:
.
@dhixsingh Please let me know what you think of this format, with a description in front of each option
(I can still do what you suggested above if descriptive tags are preferred).
Usage: ./sa2preprocess.R [options]
This script pre-processes the files downloaded from ABS TableBuilder in preparation to be used in population synthesis.
1. Removes any impossible entries in household and person SA2 level population distributions based on population heuristics.
2. Compares SA2 level household and person distribution, and cleans the data based on population heuristics. This part assumes household distribution as the accurate one of the two and person types distribution is updated to match household types distribution.
3. Calculates SA1 level household distribution based on the corresponding SA2 household distribution. The calculation ensures valid household types distributions at SA1 level, but not person type distributions.
Options:
--hi=HOUSEHOLDS INPUT
Household data file from ABS. The file can be either a zip or a csv. [default= ../../data/melbourne/raw/Households_2016_Greater_Melbourne_SA2.zip]
--pi=PERSONS INPUT
Person data file from ABS. The file can be either a zip or a csv. [default= ../../data/melbourne/raw/Persons_2016_Greater_Melbourne_SA2.zip]
-o OUTPUT DIRECTORY, --o=OUTPUT DIRECTORY
The path of the output directory. [default= ../../data/melbourne/processed/SA2/]
--sa2s=SA2 LIST
The list of SA2s to process. The parameter can be either "*" - for all SA2s in household and person input files, a comma seperated list of SA2 names or a plain text file with one SA2 per line [default= *]
-a, --a
Set this flag to calculate SA1 level household distribution. [default= FALSE]
--sa1files=SA1 FILES
A list of comma separeted ABS downloaded files giving the SA2s, their SA1s and the number of households in each SA1 by the household types. [default= ../../data/melbourne/raw/Household_2016_by_SA2_Melbourne_Inner.zip,
../../data/melbourne/raw/Household_2016_by_SA2_Melbourne_Inner_East.zip,
../../data/melbourne/raw/Household_2016_by_SA2_Melbourne_Inner_South.zip,
../../data/melbourne/raw/Household_2016_by_SA2_Melbourne_North_East.zip,
../../data/melbourne/raw/Household_2016_by_SA2_Melbourne_North_West.zip,
../../data/melbourne/raw/Household_2016_by_SA2_Melbourne_Outer_East.zip,
../../data/melbourne/raw/Household_2016_by_SA2_Melbourne_South_East.zip,
../../data/melbourne/raw/Household_2016_by_SA2_Melbourne_West.zip]
-h, --help
Show this help message and exit
@wniroshan, some suggested edits:
--households=FILE
--persons=FILE
-o DIR, --output=DIR
--sa2s=LIST_IDS
--sa1s=LIST_FILES
The rest including descriptions is good as is. Ta.
Some suggestions with respect to commit b39bb1cb14b107690c5c9159a0411c73dc9eb9af:
sa2preprocess.R
runs but finished with a NULL
output on the console. Could you fix that and maybe make it end with Done
or something so that it is obvious that the whole thing ran correctly.""
on column 1. For instance see example output created in data/melbourne/processed/SA2/Yarraville/persons.csv.gz
. If you could put a meaningful label to this column. household_types.csv.gz
and person_types.csv.gz
./data/melbourne/generated/SA2/Yarraville/preprocessed/*_types.csv.gz
, and then later on the final population could go in ./data/melbourne/generated/SA2/Yarraville/population/*.csv.gz
README.md
. Include column headers and their meanings. Also update the paths in the README.md
given changes above.The empty column header in persons.csv.gz and households.csv.gz files represents row names of the R matrix data structure. So, we cannot add a header to that. Instead, I'm not writing row names to csv at all. The row names are simply line numbers which have no significance.
Other changes are available from f0e8ed05d1297b7c62c448633c2ca43986431cdc
Wrote unit tests to verify ABS file reading, data cleaning and SA1 level household distribution estimation. The unit tests execute without any issues on Travis from commit 9f75a43991519bdb5ad6d8026a940bae0481a864
The R scripts used for preprocessing and cleaning ABS data perform extra tasks that are not relevant to the algorithm used here. The relevant functionalities need to be extracted and put into independent scripts so they can be easily maintained in the future.