Refactor ABS data preprocessing R scripts

wniroshan commented 6 years ago

The R scripts used for preprocessing and cleaning ABS data perform extra tasks that are not relevant to the algorithm used here. The relevant functionalities need to be extracted and put into independent scripts so they can be easily maintained in the future.

dhixsingh commented 6 years ago

Also the related README.md needs updating.

dhixsingh commented 6 years ago

If we could also update the help message (currently says what is below), to include a better description of what each option actual does, esp wrt household and persons distributions. Also --output should be --outputDir or similar. Check the others too to make sure their intent is clear.

#!shell
Usage: ./sa2preprocess.R [options]

Options:
    -hi CHARACTER, --householdinput=CHARACTER
        Household data from ABS[default= ../data/latch/raw/SA2, NPRD and HCFMD.csv]

    -ii CHARACTER, --individualinput=CHARACTER
        Individual data from ABS[default= ../data/latch/raw/SA2, RLHP Relationship in Household, SEXP and AGE5P.csv]

    -sa1tosa2 CHARACTER, --sa1bysa2home=CHARACTER
        Household distribution in SA1 by SA2s [default= ../data/latch/raw/Hh-SA1-in-each-SA2/]

    -o CHARACTER, --output=CHARACTER
        output file location [default= ../data/latch/absprocessed/SA2/]

    -sa2 CHARACTER, --sa2list=CHARACTER
        list of SA2s to process [default= Alphington - Fairfield,Northcote,Thornbury,Bundoora - East,Greensborough,Heidelberg - Rosanna,Heidelberg West,Ivanhoe,Ivanhoe East - Eaglemont,Montmorency - Briar Hill,Viewbank - Yallambie,Watsonia,Kingsbury,Preston,Reservoir - East,Reservoir - West]

    -h, --help
        Show this help message and exit

dhixsingh commented 6 years ago

We should also have a set of tests (automated on Travis) for the various functions that this pre-processing script performs.

dhixsingh commented 6 years ago

Add description for script after Usage: and before Options:.

wniroshan commented 6 years ago

@dhixsingh Please let me know what you think of this format, with a description in front of each option (I can still do what you suggested above if descriptive tags are preferred).

Usage: ./sa2preprocess.R [options]
This script pre-processes the files downloaded from ABS TableBuilder in preparation to be used in population synthesis. 
1. Removes any impossible entries in household and person SA2 level population distributions based on population heuristics.
2. Compares SA2 level household and person distribution, and cleans the data based on population heuristics. This part assumes household distribution as the accurate one of the two and person types distribution is updated to match household types distribution.
3. Calculates SA1 level household distribution based on the corresponding SA2 household distribution. The calculation ensures valid household types distributions at SA1 level, but not person type distributions.

Options:
    --hi=HOUSEHOLDS INPUT
        Household data file from ABS. The file can be either a zip or a csv. [default= ../../data/melbourne/raw/Households_2016_Greater_Melbourne_SA2.zip]

    --pi=PERSONS INPUT
        Person data file from ABS. The file can be either a zip or a csv. [default= ../../data/melbourne/raw/Persons_2016_Greater_Melbourne_SA2.zip]

    -o OUTPUT DIRECTORY, --o=OUTPUT DIRECTORY
        The path of the output directory. [default= ../../data/melbourne/processed/SA2/]

    --sa2s=SA2 LIST
        The list of SA2s to process. The parameter can be either "*" - for all SA2s in household and person input files,  a comma seperated list of SA2 names or a plain text file with one SA2 per line [default= *]

    -a, --a
        Set this flag to calculate SA1 level household distribution. [default= FALSE]

    --sa1files=SA1 FILES
        A list of comma separeted ABS downloaded files giving the SA2s, their SA1s and the number of households in each SA1 by the household types. [default= ../../data/melbourne/raw/Household_2016_by_SA2_Melbourne_Inner.zip,
    ../../data/melbourne/raw/Household_2016_by_SA2_Melbourne_Inner_East.zip,
    ../../data/melbourne/raw/Household_2016_by_SA2_Melbourne_Inner_South.zip,
    ../../data/melbourne/raw/Household_2016_by_SA2_Melbourne_North_East.zip,
    ../../data/melbourne/raw/Household_2016_by_SA2_Melbourne_North_West.zip,
    ../../data/melbourne/raw/Household_2016_by_SA2_Melbourne_Outer_East.zip,
    ../../data/melbourne/raw/Household_2016_by_SA2_Melbourne_South_East.zip,
    ../../data/melbourne/raw/Household_2016_by_SA2_Melbourne_West.zip]

    -h, --help
        Show this help message and exit

dhixsingh commented 6 years ago

@wniroshan, some suggested edits:

For households, maybe --households=FILE
For persons, --persons=FILE
For output directory, -o DIR, --output=DIR
For SA2s, --sa2s=LIST_IDS
For SA1s, --sa1s=LIST_FILES

The rest including descriptions is good as is. Ta.

dhixsingh commented 6 years ago

Some suggestions with respect to commit b39bb1cb14b107690c5c9159a0411c73dc9eb9af:

sa2preprocess.R runs but finished with a NULL output on the console. Could you fix that and maybe make it end with Done or something so that it is obvious that the whole thing ran correctly.
Output persons CSV headers has a empty label "" on column 1. For instance see example output created in data/melbourne/processed/SA2/Yarraville/persons.csv.gz. If you could put a meaningful label to this column.
Output CSVs should be called household_types.csv.gz and person_types.csv.gz
Generated directory structure could be ./data/melbourne/generated/SA2/Yarraville/preprocessed/*_types.csv.gz, and then later on the final population could go in ./data/melbourne/generated/SA2/Yarraville/population/*.csv.gz
Add something about the format of the generated CSV files in the README.md. Include column headers and their meanings. Also update the paths in the README.md given changes above.

wniroshan commented 6 years ago

The empty column header in persons.csv.gz and households.csv.gz files represents row names of the R matrix data structure. So, we cannot add a header to that. Instead, I'm not writing row names to csv at all. The row names are simply line numbers which have no significance.

Other changes are available from f0e8ed05d1297b7c62c448633c2ca43986431cdc

wniroshan commented 6 years ago

Wrote unit tests to verify ABS file reading, data cleaning and SA1 level household distribution estimation. The unit tests execute without any issues on Travis from commit 9f75a43991519bdb5ad6d8026a940bae0481a864

agentsoz / synthetic-population

Refactor ABS data preprocessing R scripts #8