ProjectMOSAIC / mosaicData

R package with Project MOSAIC datasets
5 stars 6 forks source link

issues with Saratoga #26

Closed nicholasjhorton closed 4 years ago

nicholasjhorton commented 6 years ago

Date: February 5, 2018 at 18:26:16 EST To: Randall Pruim rpruim@calvin.edu Subject: Clarification of data for SaratogaHouses in mosaicData

Hello Randall,

I’ve started working on a data analysis assignment for uni, based on the data for Saratoga Houses in the mosaicData package for R, and am having trouble understanding some of it – the definitions and actual values seem to contradictory in some cases, and unusual in others.

Some of the queries I have at the moment: “salePrice” The data dictionary in the mosaicData package states: "price (1000s of US dollars)", and the data actually contains values from 5000 .. 775000, which would make the sale price of the properties in the range $5,000,000 to $775,000,000 - that's seriously expensive rural real estate, if the multiplier were true!

“lotSize” Actual values range from 2,000 to 775,000

As a comparative example, the smallest lot size on a new estate in the outer suburbs of Melbourne (in 2018), is approximately 350m^2 (~3,770ft^2), with a smallish single-storey two-bedroom house of 8m x 8m = 64m^2 (~690ft^2). Also, a smallish 2-bedroom apartment in a high-rise development on the outskirts of Melbourne's CBD, tends to be around 75m^2 (~810ft^2).

The data dictionary in the mosaicData package states: "size of lot (square feet)", and the data actually contains values from 0 .. 12.2 - how can anyone live on a block of land of 0, 0.01, 0.1 or 12 sqr feet? I suspect that a multiplier of x100 or x1000 needs to be applied.

“landValue” Actual values range from 616 to 5,300

… perhaps a multiplier of 10,000 converts the lotSize to a more reasonable value when compared with our examples above from Melbourne.

From an extract of obs, something doesn't quite add-up:

The only explanation I can come up with for this, is that the land for obs #4 is so degraded or so far out of town that it doesn't attract a high valuation, but the inverse for obs #27, it's right in the centre of a city or large town.

“livingArea” The data dictionary in the mosaicData package states: "value of land (1000s of US dollars)", and the data actually contains values from 200 .. 425000, which makes the landValue of the properties in the range $200,000 to $425,000,000 - that's a big range of values! How does the landValue relate to the salePrice and livingArea?

From an extract of obs:

“pctCollege” How was “neighbourhood” defined, same street, suburb, distance to CBD, local government/ electoral boundary?

It would be useful to know which suburb the property was in, how far it was from the nearest CBD, and how "neighbourhood" was defined. Unfortunately, the data dictionary doesn't provide us with any more details on this variable. The additional (categorical) data might provide a better insights on price disparities, and may account for some of the modes we can see in the distribution for this variable.

Attached is a cut-down version of an RMarkdown I’ve used to do the initial exploratory analysis.

I’d really appreciate it if you could you provide me with a link to further details on the dataset, or clarifications on how to use the above variables.

Thank you, David

nicholasjhorton commented 6 years ago

Dick's response:

OK. I don’t know how to respond directly to GitHub (I know, I know…)

But…

Price is in $ (not $1000) Min $5000 Max $775000 For clarification these are 2002$ lotSize is in Acres (!).

He says: “lotSize” Actual values range from 2,000 to 775,000

What?

summary(SaratogaHouses) price lotSize age landValue livingArea
Min. : 5000 Min. : 0.0000 Min. : 0.00 Min. : 200 Min. : 616
1st Qu.:145000 1st Qu.: 0.1700 1st Qu.: 13.00 1st Qu.: 15100 1st Qu.:1300
Median :189900 Median : 0.3700 Median : 19.00 Median : 25000 Median :1634
Mean :211967 Mean : 0.5002 Mean : 27.92 Mean : 34557 Mean :1755
3rd Qu.:259000 3rd Qu.: 0.5400 3rd Qu.: 34.00 3rd Qu.: 40200 3rd Qu.:2138
Max. :775000 Max. :12.2000 Max. :225.00 Max. :412600 Max. :5228

Ok, later he says 0 to 12.2. Right. Acres. He could multiply by 4046.86 to get sq meters (or similarly 0.404686 to get hectares)

“landValue” Actual values range from 616 to 5,300

Again — Huh?

No, they are the same $ and the range is $200 to $412,600 as listed. Note: There are three houses for which price<landValue. The “prices” of these houses is $10,300, $10,300 and $5000. Sometimes property is transferred (usually to a relative) for a nominal price that has nothing to do with the value. It also looks like rows 851 and 891 are copies of each other. Oh well. Real data!!

“livingArea” The data dictionary in the mosaicData package states: "value of land (1000s of US dollars)", and the data actually contains values from 200 .. 425000, which makes the landValue of the properties in the range $200,000 to $425,000,000 - that's a big range of values! How does the landValue relate to the salePrice and livingArea?

Huh? livingArea is in sq ft. To get Sq meters divide by 10 (or more precisely multiply by 0.092903).

“pctCollege” How was “neighbourhood” defined, same street, suburb, distance to CBD, local government/ electoral boundary?

Answer: School District

It would be useful to know which suburb the property was in, how far it was from the nearest CBD, and how "neighbourhood" was defined. Unfortunately, the data dictionary doesn't provide us with any more details on this variable. The additional (categorical) data might provide a better insights on price disparities, and may account for some of the modes we can see in the distribution for this variable.

All are in Saratoga County. no more specific address is specified.

Back to price. Checking, there are 7 houses listed with prices < $50,000:

SaratogaHouses[SaratogaHouses$price<50000,] price lotSize age landValue livingArea pctCollege bedrooms fireplaces bathrooms rooms 109 25000 0.21 75 900 920 44 2 0 1.0 6 122 45000 0.52 75 3900 912 44 2 0 1.0 4 459 20000 0.52 59 8000 936 20 2 0 1.0 4 851 10300 0.16 20 15700 912 54 2 1 1.5 4 891 10300 0.16 20 15700 912 54 2 1 1.5 6 986 49387 0.48 56 31800 900 77 2 1 1.0 5 1011 5000 0.29 4 35800 1700 63 3 1 2.5 6

It looks to me like quite possibly the only “real” price is the $49387. But I have no more information Would be interesting to compare models with and without these houses.

Questions?

nicholasjhorton commented 6 years ago

Fixed via https://github.com/ProjectMOSAIC/mosaicData/commit/c1c7161f1c222b18cb5ee322821a390e3fdd4de5 thanks to @mchien20

rpruim commented 4 years ago

Double check units on price and lot size in code book (as per email inquiry).