Analysis #23 clean up DC black knight data

maggiesu0725 commented 5 years ago

@lhendey the code is running and ready for a review. I'm having some issue knitting the document, I'll let you know when the html document is ready. cc @sstrochak

lhendey commented 5 years ago

@maggiesu0725 - not sure if i'm supposed to be able to just run this - but i got an error even with the first section of code (the program header).

comments on preclean-dc

line 3 - update subtitle.

line 107 - please update this comment (you're not using the parcel shapefile) need to select active parcels - use in in_last_ownerpt=1

line 113- lot in parcel base is just the last for characters (numbers) of the parcel ID - (SSL - square, suffix, lot) - not sure what the purpose of bringing it in separately?

line 130 - what kind of match on the merge /duplicates did we get?

lhendey commented 5 years ago

@maggiesu0725

comments on clean-dc.rmd

line 246 "UNIMPROVED-USE NOT SPECIFIED", - this should be vacant.

line 276 - coops are not condos and i think residential conversations are apartments. why are these all grouped as condos?

line 355 - take single family garage out of otherres - move to othernonres line 373 move "UNIMPROVED-USE NOT SPECIFIED", to vacant

line 415 - res cooperative horizontal - why is it not in horizontal with the other horizontal?

i don't think i can review the rest until i can see some of the output.

maggiesu0725 commented 5 years ago

@sstrochak @awunderground Hi Sarah and Aaron, I kept having issue with this block of code:``` {r nest} in the program "clean-DC.Rmd" that stopped the document from knitting, I could run the code by block through the entire program without a problem and produce the final cleaned dataset, however. The error is "returning -Infno non-missing arguments to max" I checked the mutated variables that used the max function but I don't think they should cause this error. Do you have any idea for solving this?

lhendey commented 5 years ago

@maggiesu0725 - can you push your updates to the branch and i can review the html file there.

lhendey commented 5 years ago

@maggiesu0725 is there an html file for the preclean?

lhendey commented 5 years ago

@maggiesu0725

Comments on preclean - i got an error on the merge (line 156) - i think because it appears lines 134-148 are commented out? Error in left_join(jurid, parcelbaseDC, by = c(parcelnumber = "parcelbase_SSL")) : object 'jurid' not found

lhendey commented 5 years ago

@maggiesu0725 - i committed one small change.

maggiesu0725 commented 5 years ago

@lhendey resolved most of the comments, had one question left: Why are we keeping only observations that are in parcelbase instead of blackknight data?

lhendey commented 5 years ago

@maggiesu0725
the merges don't look like they are happening right- the parcel_geo + parcel_base merge had more observations than parcel_base? also getting this error:

jur %>% filter(is.na(parcelbase_x)) %>% nrow() Error in filter_impl(.data, quo) : Evaluation error: object 'parcelbase_x' not found.

lhendey commented 5 years ago

@maggiesu0725 the merge in the preclean file is not working correctly because you are dropping important parts of the SSL. There are three components of the ssl (square, suffix and lot) the way you are merging to the BlackKnight data you are dropping the suffix. For example you might have a SSL that looks like "0004N 0027" and you are losing that N. the second left_join to create jur is resulting in a file that is 26 million obs. I need you to address these issues before I can move further in the review.

I do see that there are SSLs in the BLackKnight data that have the suffix as "0004 N 0027" it may be easier to break into Square suffix lot and then merge on those three vars. in our sas data suffix may also be 4 chars. ("N ") let me know if you have questions on this.

lhendey commented 5 years ago

@maggiesu0725 - there are also some parcels in BlackKnight that have weird prefixs on the assessorsparcelnumberapnpin - for example (WF00 0771 2007) with address 301 Tingey St SE - you can look it up in the master address repository and see that the SSL is just "0771 2007" http://dcatlas.dcgis.dc.gov/mar/

lhendey commented 5 years ago

@maggiesu0725 - i tried to knit preclean-dc and i'm getting an error.

Quitting from lines 137-139 (preclean-DC.Rmd) Error in eval(lhs, parent, parent) : object 'parcel' not found Calls: ... hook_eval -> withVisible -> eval -> eval -> %>% -> eval -> eval Execution halted

lhendey commented 5 years ago

@maggiesu0725 - @rpitingolo is going to find you after his 10 o'clock meeting to help sort out the parcel id issues for the join.

rpitingolo commented 5 years ago

@lhendey we are working on this now. @maggiesu0725

maggiesu0725 commented 5 years ago

@rpitingolo was on the wrong pull request... I finished the program and added comment . Please let me know if you see anything strange.

rpitingolo commented 5 years ago

@maggiesu0725 thanks. I will review. @lhendey

rpitingolo commented 5 years ago

@maggiesu0725 I was able to run this without errors. A few follow-ups:

[line 234] should this be a right_join? It seems like the n from jur should = the n from finaljur.

[general comment] I understand how you created newparcelid from the S,S and L components. Ideally this should be done without the hyphens but I'm not adept in R enough to know exactly how to do it.

@lhendey I spot checked the non matches are a lot are because of condos and other quirks. For example in nonmatchingblackknight has 0673 0844 (55 M ST NE) which isn't in parcel_base because parcel_base has every unit at the building as its own SSL with the same owner. I assume this was a building that was switched from condo to apartment at some point in development.

Do you want to try to rectify cases like these or leave them as-is?

maggiesu0725 commented 5 years ago

@rpitingolo answer: 1, no we want the left join because we want to keep all records from parcel base file instead of black knight per Leah's previous comment.

By without hypens do you mean replacing them wiith blank? That should be easy we can just replace - with " " in the commands. Or do you mean keeping the length of the id floating?

rpitingolo commented 5 years ago

@maggiesu0725 thanks for clarifying #1. What I mean on #2 is that the format should be the same as SSL in parcel_base, but maybe it doesn't matter since it was used for a very spceific purpose (@lhendey do you have a take on this?)

Also, the parcels that start with WF00 should be re-coded so that the square is 0771, no suffix, and lot is the same. I'm not sure where the "WF00" came from but I spot checked to confirm they would match with parcel_base once re-coded. Can you please make this edit and then commit so I can review? Thanks.

lhendey commented 5 years ago

@maggiesu0725 - re: 2 above = use spaces as rob suggests instead of hyphens.

In addition to the WF ones - there are ones that begin with "BD", "BDPI", "PI" that should be treated as Rob describes.

maggiesu0725 commented 5 years ago

@rpitingolo updated the code. weirdly, the non matched observations from both dataset don't actually change. Could you take a look?

rpitingolo commented 5 years ago

@maggiesu0725 I'll take a look.

rpitingolo commented 5 years ago

Hi @maggiesu0725 please double check your new code. newparcelid is not working correctly for all the WF, BD IDs. You can see in finaljur that it works sometimes but in many cases does not.

maggiesu0725 commented 5 years ago

@rpitingolo thanks, could you provide an example? I'm having a hard time checking where it's not working

rpitingolo commented 5 years ago

@maggiesu0725 Please remove the code#There arer parcel %>% group_by(parcel_id) %>% filter(n()>1) %>% nrow()parcels that show up more than once on the parcel file. We combine these by taking the average area and width. that is in between chunks. It is causing issues during knitting.

maggiesu0725 commented 5 years ago

@rpitingolo please review.

rpitingolo commented 5 years ago

@maggiesu0725 thanks. Looking now.

rpitingolo commented 5 years ago

@maggiesu0725 there were two typos in your new code that I fixed and committed.

@lhendey knitting works, I think we can call this complete.

lhendey commented 5 years ago

@maggiesu0725 - @rpitingolo referred to preclean-dc. is clean-dc ready for review?

maggiesu0725 commented 5 years ago

@lhendey working on it, aiming to have it done today.

maggiesu0725 commented 5 years ago

@rpitingolo could you review clean-DC?

rpitingolo commented 5 years ago

@maggiesu0725 Leah is going to review clean-DC. @lhendey

lhendey commented 5 years ago

@maggiesu0725 - finished review of clean-dc.rmd - i'll email handwritten comments on the markdown. most relate to better use of vars from parcel base.

lhendey commented 5 years ago

@maggiesu0725 - was there another change you were going to make or is this ready to look at?

lhendey commented 5 years ago

@maggiesu0725 - i'm getting an error on line 1126 (where it's writing the final file) when i tried to knit. can you take a look? I'll move to looking at the other changes now.

maggiesu0725 commented 5 years ago

Thanks, I think it's the classify address macro I used, it checked for another variable (housenumber from black knight) in addition to address for missing category and we shouldn't use that in this case. I've updated the code. @lhendey

lhendey commented 5 years ago

@maggiesu0725 let me know when you want me to look again.

maggiesu0725 commented 5 years ago

I updated the clean parcelbase address and filtered out garages for collapsing. Please review. @lhendey

lhendey commented 5 years ago

@maggiesu0725 This warning comes up twice in the new section on cleaning address. does it need resolved or is it expected? Expected 8 pieces. Missing pieces filled with NA in 9556 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

lhendey commented 5 years ago

@maggiesu0725 - there are still letters on the end of addresses in cleanparceladdress that would affect the collapse?

lhendey commented 5 years ago

@maggiesu0725

in addition to the two comments above -updates comments beginning on line 795 -i see that the number of properties using first is way down. - but some of those using first only have nprops=1 - i thought that the properties in nested were all multiples? -update comments on line 838

maggiesu0725 commented 5 years ago

-This warning comes up twice in the new section on cleaning address. does it need resolved or is it expected?

Expected 8 pieces. Missing pieces filled with NA in 9556 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

This should be fine since I'm allowing each address to separate into 8 parts max, if the address separate less than 8, NA will be generating for the other slots.

-Comments updated.

-i see that the number of properties using first is way down. - but some of those using first only have nprops=1 - i thought that the properties in nested were all multiples?

nprops was created by calculating the length of list of the list variable created in the nested step. I've tried several alternative variables but I'm not sure why it's not calculating the list length correctly. Plus in some cases the length is one because there's one unique value. I tried to just use number of rows for each group as nprop but because it is a nested list, I can't make the code work for now. I will ask for help for that.

there are still letters on the end of addresses in cleanparceladdress that would affect the collapse? I assume we will remove address that end with P or G and remove the ending letter and keep the address for others.

@lhendey

lhendey commented 5 years ago

@maggiesu0725 - let's merge!

NeighborhoodInfoDC / RegHsg

Analysis #23 clean up DC black knight data #38