ccao-data / model-res-avm

Automated valuation model for all class 200 residential properties in Cook County (except vacant land and condos)
GNU Affero General Public License v3.0
26 stars 5 forks source link

Questions Tall and Small #142

Closed JohnSparks2024 closed 6 months ago

JohnSparks2024 commented 9 months ago

Dear CCAO Staff, I want to begin by thanking you for making the assessment value modeling process as transparent and clear as you have. The additions to your web page that explain the process and major issues are very clear and informative. Allowing a member of the public to run the same process on a home computer is the ultimate in transparency and I applaud you for it.

I am particularly interested because I teach statistics and statistical software (including R) at UIC and I am planning a major addition and remodel of my house that will surely affect my assessed value and tax bill. I have looked at the tax bills of a couple of houses in my neighborhood that did something similar so I have a good rough estimate of what my tax bill will be after the addition but, because of the first reason that I listed, I wanted to dig deeper into the process.

After having looked over the documentation and installed the git and dependency packages, I have a few questions and comments, from the tall to the small.

First the most logistical question is as follows. I downloaded all the files and R packages and attempted to run 01-train.R from the pipeline folder. R threw me the error message Error in all_of(): ! Can't subset columns that don't exist. ✖ Column loc_tax_municipality_name doesn't exist. Run rlang::last_trace() to see where the error occurred.

I ran the trace, but it returned a giant tree diagram that I don’t understand. Could the source of the error be that I only loaded the 2023 data to the input folder? It didn’t seem sensible to load all of the 2023, 2022 and 2021 data to the same input folder because some of the data files had duplicate names.

Next a very high-level question. You indicate that the model estimates the sale price of all properties sold in a particular year using the characteristics of the property, neighborhood, etc. Do you then use that model to score all the properties (sold and not sold) for a tax year? If so, at what stage in the pipeline does this take place?

I also noticed that longitude and latitude are independent variables in the model. This seems quite strange. Could you help me to understand what role these variables play in the model.

Related, I tried to find a meaningful feature importance report on your website so that I could look at the importance of the longitude and latitude variable, but was not able to find that report. If it exists, can you please direct me to it?

Last, as long as I have you, I believe I noticed a small mistake in your variable documentation (‘Features Used’). The first variable has the name ‘Percent Population Age, Under 19 Years Old’. In looking at the description, I believe it should be ‘…Under 18 Years Old’.

Thank you for your time and attention. I look forward to hearing from you when it is reasonably convenient.

--John Sparks

dfsnow commented 9 months ago

Happy to help! Let me address your comments in turn:

First the most logistical question is as follows. I downloaded all the files and R packages and attempted to run 01-train.R from the pipeline folder. R threw me the error message Error in all_of(): ! Can't subset columns that don't exist. ✖ Column loc_tax_municipality_name doesn't exist. Run rlang::last_trace() to see where the error occurred.

I suspect this is because you're using the 2023 input data with the current master branch of the model. The master branch has lots of ongoing changes related to the upcoming 2024 reassessment, so it is incompatible with the 2023 input data.

Instead, you should use the 2023 input data and then git checkout the 2023-assessment-year git tag. That tag is a snapshot of the model as it was used for last year's assessments.

Next a very high-level question. You indicate that the model estimates the sale price of all properties sold in a particular year using the characteristics of the property, neighborhood, etc. Do you then use that model to score all the properties (sold and not sold) for a tax year? If so, at what stage in the pipeline does this take place?

Yes, the trained model is used to predict the value of both sold and unsold property. This happens in the assess stage of the pipeline.

I also noticed that longitude and latitude are independent variables in the model. This seems quite strange. Could you help me to understand what role these variables play in the model.

The model uses mostly small geographies (neighborhood, township, etc.) to capture local spatial fixed effects. However, some of these geographies are quite large, and our thinking is that by including lat/lon, we can capture some intra-geography spatial variation in price.

Related, I tried to find a meaningful feature importance report on your website so that I could look at the importance of the longitude and latitude variable, but was not able to find that report. If it exists, can you please direct me to it?

We haven't published such a report since it tends to change from model to model. However, generally the most important features are exactly what you'd expect: location, square footage, and age of the property.

Last, as long as I have you, I believe I noticed a small mistake in your variable documentation (‘Features Used’). The first variable has the name ‘Percent Population Age, Under 19 Years Old’. In looking at the description, I believe it should be ‘…Under 18 Years Old’.

So this variable is indeed children under 19, you can see the definition here. However, our note attached to the variable is indeed wrong!

JohnSparks2024 commented 9 months ago

Hi Dan,

Thanks for your note.

I am not able to find the 2023-assessment-year tag.

Everything I know about git environments I learned as part of downloading your project.

Please provide a bit more specifics about how I can find and download this functionality.

Thanks. --John Sparks


From: Dan Snow @.> Sent: Thursday, January 4, 2024 11:02 AM To: ccao-data/model-res-avm @.> Cc: JohnSparks2024 @.>; Author @.> Subject: Re: [ccao-data/model-res-avm] Questions Tall and Small (Issue #142)

Happy to help! Let me address your comments in turn:

First the most logistical question is as follows. I downloaded all the files and R packages and attempted to run 01-train.R from the pipeline folder. R threw me the error message Error in all_of(): ! Can't subset columns that don't exist. ✖ Column loc_tax_municipality_name doesn't exist. Run rlang::last_trace() to see where the error occurred.

I suspect this is because you're using the 2023 input data with the current master branch of the model. The master branch has lots of ongoing changes related to the upcoming 2024 reassessment, so it is incompatible with the 2023 input data.

Instead, you should use the 2023 input data and then git checkout the 2023-assessment-year git tag. That tag is a snapshot of the model as it was used for last year's assessments.

Next a very high-level question. You indicate that the model estimates the sale price of all properties sold in a particular year using the characteristics of the property, neighborhood, etc. Do you then use that model to score all the properties (sold and not sold) for a tax year? If so, at what stage in the pipeline does this take place?

Yes, the trained model is used to predict the value of both sold and unsold property. This happens in the assess stage of the pipeline.

I also noticed that longitude and latitude are independent variables in the model. This seems quite strange. Could you help me to understand what role these variables play in the model.

The model uses mostly small geographies (neighborhood, township, etc.) to capture local spatial fixed effects. However, some of these geographies are quite large, and our thinking is that by including lat/lon, we can capture some intra-geography spatial variation in price.

Related, I tried to find a meaningful feature importance report on your website so that I could look at the importance of the longitude and latitude variable, but was not able to find that report. If it exists, can you please direct me to it?

We haven't published such a report since it tends to change from model to model. However, generally the most important features are exactly what you'd expect: location, square footage, and age of the property.

Last, as long as I have you, I believe I noticed a small mistake in your variable documentation (‘Features Used’). The first variable has the name ‘Percent Population Age, Under 19 Years Old’. In looking at the description, I believe it should be ‘…Under 18 Years Old’.

So this variable is indeed children under 19, you can see the definition herehttps://github.com/ccao-data/data-architecture/blob/414b91deb3debb838df1d7d7113f795d4b621b0d/dbt/models/census/columns.md?plain=1#L6. However, our note attached to the variable is indeed wrong!

— Reply to this email directly, view it on GitHubhttps://github.com/ccao-data/model-res-avm/issues/142#issuecomment-1877452530, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BFDCQCDSOHFY5PFIU4YEGJLYM3OATAVCNFSM6AAAAABBME2JJ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZXGQ2TENJTGA. You are receiving this because you authored the thread.Message ID: @.***>

dfsnow commented 9 months ago

@JohnSparks2024 I'm more than happy to help with the specifics of the model, but teaching git is a little bit outside of our purview as a team. I recommend checking out GitHub's git resources page to get started!

JohnSparks2024 commented 9 months ago

I understand.

If you could just send me a screenshot of where the branch is located I can take it from there.

I will also look into the resources page you sent me.

Thanks for all your help. --JJS


From: Dan Snow @.> Sent: Thursday, January 4, 2024 5:18 PM To: ccao-data/model-res-avm @.> Cc: JohnSparks2024 @.>; Mention @.> Subject: Re: [ccao-data/model-res-avm] Questions Tall and Small (Issue #142)

@JohnSparks2024https://github.com/JohnSparks2024 I'm more than happy to help with the specifics of the model, but teaching git is a little bit outside of our purview as a team. I recommend checking out GitHub's git resources pagehttps://docs.github.com/en/get-started/quickstart/git-and-github-learning-resources to get started!

— Reply to this email directly, view it on GitHubhttps://github.com/ccao-data/model-res-avm/issues/142#issuecomment-1877893395, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BFDCQCHLZ6SGAN3L2ZEQ3NLYM42ETAVCNFSM6AAAAABBME2JJ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZXHA4TGMZZGU. You are receiving this because you were mentioned.Message ID: @.***>

dfsnow commented 9 months ago

On the GitHub UI, you can find it in the branches/tags dropdown:

Screenshot 2024-01-04 at 4 39 25 PM
JohnSparks2024 commented 9 months ago

Got it! Thanks.

--JJS


From: Dan Snow @.> Sent: Thursday, January 4, 2024 6:40 PM To: ccao-data/model-res-avm @.> Cc: JohnSparks2024 @.>; Mention @.> Subject: Re: [ccao-data/model-res-avm] Questions Tall and Small (Issue #142)

On the GitHub UI, you can find it in the branches/tags dropdown:

Screenshot.2024-01-04.at.4.39.25.PM.png (view on web)https://github.com/ccao-data/model-res-avm/assets/31494343/668b1956-e879-427b-9d5e-e87af92c813f

— Reply to this email directly, view it on GitHubhttps://github.com/ccao-data/model-res-avm/issues/142#issuecomment-1877957553, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BFDCQCBLZALSN26H3B74LH3YM5DX3AVCNFSM6AAAAABBME2JJ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZXHE2TONJVGM. You are receiving this because you were mentioned.Message ID: @.***>

JohnSparks2024 commented 9 months ago

Dan,

Good news! I got much further.

But it appears that the object lgbm_wflow_final_fit did not create, even though the pieces to make it appear to exist.

Any clarification you can provide would be much appreciated.

--JJS

Fit the final model using the training data and our final hyperparameters

This model is used to measure performance on the test set

message("Fitting final model on training data") Fitting final model on training data lgbm_wflow_final_fit <- lgbm_wflow %>%

  • update_model(lgbm_model_final) %>%
  • finalize_workflow(lgbm_final_params) %>%
  • fit(data = train) Error in vectbl_assign(x[[j]], i, recycled_value[[j]]) : DLL requires the use of native symbols ls() [1] "cv_enable" "df" "extract_num_iterations" "GCtorture" [5] "lgbm_final_params" "lgbm_missing_params" "lgbm_model" "lgbm_model_final" [9] "lgbm_wflow" "model_delete_run" "model_fetch_run" "model_file_dict" [13] "model_main_recipe" "num_threads" "params" "paths" [17] "rolling_origin_pct_split" "select_max_iterations" "split_data" "test" [21] "train" "train_recipe" "training_data_full" "var_encode" head(lgbm_wflow_final_fit) Error: object 'lgbm_wflow_final_fit' not found

From: Dan Snow @.> Sent: Thursday, January 4, 2024 6:40 PM To: ccao-data/model-res-avm @.> Cc: JohnSparks2024 @.>; Mention @.> Subject: Re: [ccao-data/model-res-avm] Questions Tall and Small (Issue #142)

On the GitHub UI, you can find it in the branches/tags dropdown:

Screenshot.2024-01-04.at.4.39.25.PM.png (view on web)https://github.com/ccao-data/model-res-avm/assets/31494343/668b1956-e879-427b-9d5e-e87af92c813f

— Reply to this email directly, view it on GitHubhttps://github.com/ccao-data/model-res-avm/issues/142#issuecomment-1877957553, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BFDCQCBLZALSN26H3B74LH3YM5DX3AVCNFSM6AAAAABBME2JJ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZXHE2TONJVGM. You are receiving this because you were mentioned.Message ID: @.***>

dfsnow commented 9 months ago

Error in vectbl_assign(x[[j]], i, recycled_value[[j]]) : DLL requires the use of native symbols

It looks like you have an issue with your installed version of dplyr. See the GitHub issue here for some suggestions.

dfsnow commented 6 months ago

@JohnSparks2024 did you have any other questions re: this issue or our model in general? I'm doing some modeling season wrap up and trying to close out old issues. Feel free to re-open this if your questions weren't answered.