Modeling and Analysis - Githubissues

JestonBlu commented 7 years ago

Okay, so here is an issue to post anything you want regarding modeling methods or analysis. I will update you all on this thread as i progress so feel free to give feedback here at any point. @nitroys @rmglazner @NancyDrew484

JestonBlu commented 7 years ago

So far Im pulling the project requirements from this section of the Background issue.

The agency would like to know if there is a statistical and significant difference 
between the four geographical sub-neighborhoods (designated 1 through 4 on the map) 
for the following aspects of community life:

a)  Satisfaction Level
b)  Participation in the Community
c)  Willingness to Become Involved
d)  Opinion on Police Response
e)  Safety
f)  Community Improvement Perceptions
g)  Homeownership
h)  Age, racial, and gender composition

(2) The second part to this project is what this STAT653 group (that's the four of us) will do. 
Since we have been learning about the various analysis methods, I'm hoping that we can 
provide the "statistically significant" part of the NWRoc request.

I think there is a little bit of difference between what we have learned in class so far and what this project is going require. All of the response variables in class have been numeric, but what we have is categorical so it really requires a different type of modeling. I took categorical data analysis last spring so Im familiar with what we need to do, but I wonder if maybe we should check with Dr. Akleman again and make sure we are allowed to use other methods.

In terms of satisfying the interests of NW, Chi-squared GOF tests will help determine if there are differences in responses between the neighborhoods. We can also add a third variable (for instance testing satisfaction by neighborhood across the levels of police rating) and do a CMH test. One issue I have had so far is that there are very few responses using the 1 or 2 rating for satisfaction or recommendation so in order to get a meaningful test I needed to combine the responses of 1 and 2 into 3. I have created override columns for the variables in question and left the original data alone for now. I will also have to combine responses if there are not enough observations in each neighborhood for the other variables of interest.

I did a simple Chi-Squared GOF between the neighborhoods satisfaction and recommendation responses. The low pvalue indicates at the 95% Confidence Level that there are differences between the 4 neighborhoods.

If everyone thinks this is the right direction so far then I will continue these tests for all of the variables in list.

@nitroys @rmglazner @NancyDrew484

profgeraci commented 7 years ago

I think we should confirm that we can use a likert variable as a response. She really didn't specify that it had to be numeric but if it's not going to work I would rather bail now.

I had some other data related to Gainful Employment from various institutions of Higher Education, but now that I look at it, I'm not sure there's a good numerical response variable there either.

Here's the NYT article that piqued my interest: https://www.nytimes.com/2014/02/26/business/economy/the-bane-and-the-boon-of-for-profit-colleges.html?_r=0

And the files: GE-DMYR-2015-Final-Rates.xlsx GE_SSA_Earnings_2014.xlsx

Does anyone else have any @nitroys @rmglazner @JestonBlu

rmglazner commented 7 years ago

Could satisfaction level also be continuous? My guess is probably not, but I thought it might be worth consideration. Since the values of 1, 2, and 3 had to be combined I don't know if this type of model would be the most useful reflection of the response. Having that said, I think what has been done so far would definitely work if that is the only response data we have to work with. Higher ed completion rates sounds like it would be a potentially good replacement of Satisfaction and those values would definitely be continuous, right?

nitroys commented 7 years ago

I agree that we should check to make sure our model falls within Dr. Aklemans expectations before we continue, but I also agree that @JestonBlu is on the right track with the chi squared tests. I can send Dr. A an email, if we all agree.

rmglazner commented 7 years ago

Okay, sending an email sounds like a good idea!

profgeraci commented 7 years ago

I think Dr. Akleman strongly prefers that we post something in the discussion board rather than e-mail. She didn't seem to be watching the "Project Data" discussion too carefully - I suggest we post something in the "General Discussion" asking for clarification. Can you please that, Shannon (@nitroys)? My computer is acting up (just had to reboot and restart the browser twice to get this posted).

@nitroys @rmglazner @JestonBlu

profgeraci commented 7 years ago

Dr. Akleman's answer is kinda frustrating, but it look like we are going to need to find some new data to analyze. Frustrating in that we were told to just "find some data to analyze" (without restriction) and now that we see the assignment, the instructions should have been "find some data you can analyze with these specific types of experimental designs".

OK, so now what are we going to do? We need to find some new data. Let's start a new "issue" for each type of data so the threads don't get too long.

Joseph: Do I need to do this: @nitroys @rmglazner @JestonBlu on every issue, or do we all get notified when new replies are posted?

nitroys commented 7 years ago

So I don't necessarily think we need to find a totally new dataset and start from scratch. We could do an ANCOVA style analysis for this first part, sort of a "naive" first pass as estimation, and estimate the ages or number of years people have lived in the different regions. This is more numeric response, and while it may not get at the programs goals, it wouldn't be an invalid analysis for us to do for the first part. What do you guys think?

profgeraci commented 7 years ago

Ohhhhh, I like it! Good idea! Let's go with Age, since that variable seems to be a little more accurate - most of the "Years" (how long have you lived here) answers seem to be estimates (10 years, 24 years).

I'll start working on a report format today.

JestonBlu commented 7 years ago

I think you may need to tag people initially so they are aware of the new issue thread, but once they create a post or reply then I think you should get notifications automatically.. you can see who is on the thread on at the beginning of this issue on the right hand side under participants... so everyone should be getting notified of this post...

I think the ANCOVA idea is good using the years as the response variable. I agree that its the most numeric of all the columns and there isnt any missing data. I think we can make that work. I would also like to produce the results that NW originally requested, but Ill do that as side work. We dont have to plan on incorporating that into our presentation or write up.

rmglazner commented 7 years ago

I agree with everyone that years would serve as a good response variable! It is continuous and indirectly reflects satisfaction.

JestonBlu commented 7 years ago

Okay, ill start working on that then... everyone if you wouldnt mind please list some variables that are you most interested in seeing the model. I wont be able to include all of them for sure, but if you have in that you want to see in be sure to let me know.

profgeraci commented 7 years ago

Here is my "short list" of variables:

Combined Participation Score (which you've recoded into 012 for Low; 345 for Med; 678 for High)
PoliceRating, FireRating, EMSRating,TrashRating,SnowRemovalRating - I found a strong correlation between some of these and Satisfaction. They are probably not all significant.
One of the FeelSafe variables (they are obviously correlated to each other, so let's just pick one - perhaps FeelSafeDay)
The obvious demographic vars of OwnRent, Gender, Race

Justification: If we are trying to explain AGE of the resident, I would hypothesize that Older residents might be more involved (higher Participation Score) in the neighborhood because they have more free time and motivation to preserve the neightborhood.

What other hypotheses might we have (without looking at the data itself) that we can test in this data?

Anne

JestonBlu commented 7 years ago

My preference would be to use years rather than age, but it will be simple to run both so I will do that and post the results so we can decide on the direction. I think all of those variables are good... i thought it might be interesting to see the safety perception is different between the night and day ratings.

nitroys commented 7 years ago

I agree with Anne on the variables to include. Better to keep it simple than throw everything into the model at once.

I think it will be interesting to see both Age and Years represented by the same model (since I feel like they have some kind of relationship themselves-- older people are more likely to have lived there longer). Could you maybe also include a correlation plot between the two, just for reference?

JestonBlu commented 7 years ago

Will do.

On Tue, Feb 21, 2017 at 7:57 AM, nitroys notifications@github.com wrote:

I agree with Anne on the variables to include. Better to keep it simple than throw everything into the model at once.

I think it will be interesting to see both Age and Years represented by the same model (since I feel like they have some kind of relationship themselves-- older people are more likely to have lived there longer). Could you maybe also include a correlation plot between the two, just for reference?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JestonBlu/Neighbor-Works/issues/5#issuecomment-281351007, or mute the thread https://github.com/notifications/unsubscribe-auth/ADL2hff6HHkX2jwF3IGngb9rc_ISKqI-ks5reu1SgaJpZM4MFQ1j .

rmglazner commented 7 years ago

I also agree with Anna for this. I also agree that it would be interesting to compare age and years, and see if the two factors have any correlation.

JestonBlu commented 7 years ago

Okay, i have posted some preliminary results. I created 4 models in total. 2 using years as a response, and 2 using Age as a response. I tried one of each with a log transformation. They all have similar performance with a low R2. The log(years) model has the best R2 with .43. I think its also the most interesting of the other models. It has the highest number of significant variables as well.

The models are all stored in the ReducedData.jmp file. You should be able to open it up and see them in the saved scripts windows. I also ran a univariate report on Age and Years and I ran a simple model of Years ~ Age. Age is by far the most significant variable for explaining Years and the same is true with you do Age ~ Years. Intuitively this makes sense to me.

I initially tried using the data as is, but the fits were really bad I think due to major lack of variation for some of the variables. I ended up classifying most of the variables used so far as low, medium, high. I also copied the data to a new tab of the spreadsheet so its a little easier to look at.

I created a summary.md file in the analysis directory that shows all of the models I created plus the equation form which I think we will have to show in the presentation and write up. I haven't dont any real interpretation yet, but I wanted to post what I have so you all can have time to look it over and give me some feedback.

I suggest we pick only one of the models to talk about and put in our report/presentation. Take a look at all of them and lets discuss which is most interesting to everyone.

nitroys commented 7 years ago

Joseph, is it easy for you to upload your model output as a pdf, maybe? I've been using SAS for the class because I have SAS at work and I don't have a great computer to download JMP on. If it's too much trouble for you I can figure that out, but if you don't mind, it would be very helpful. Thanks!

JestonBlu commented 7 years ago

Yeah, ill see if I can do that. I believe I can also export the model scripts as a SAS script too so Ill do that as well. Probably wont be able to get to it till the morning though, but ill do it first thing.

nitroys commented 7 years ago

Awesome, no rush. Thanks!!

JestonBlu commented 7 years ago

Hi everyone, I have uploaded all of those reports in a few different formats (pdf, html, ppt). Hopefully that will make it easier to build the report and presentation. I also converted the data to a sas data set, but I wasnt able to generate the code for the same sas equivalent reports. I imagine we will probably be migrating to sas only soon though.

Please take a look and let me know if anything doesnt look right.

JestonBlu commented 7 years ago

Hey everyone. I have now added a SAS script as well. This one is just the model with Years ~ Age + PoliceRating + SafeDay + OwnRent. I also tested for interactions with Age and Age OwnRent was significant so I added that as well. I also uploaded a SAS pdf of the results with some basic regression plots. Let me know what you all think.

nitroys commented 7 years ago

I'm looking over the model right now, and I agree with you, Joseph, that the log years model is the most interesting and significant. However, I think that the neighborhood id variable should stay in the model, despite it not being significant, because the differences in the neighborhoods (whether their slopes are different) is what we're really interested in. Do you mind rerunning just the log years model, keeping that variable in? I can keep writing for now and fill in the details later, so no rush on that. Thanks!

JestonBlu commented 7 years ago

Not at all. That's a good point. I'll do that.

On Feb 26, 2017 5:11 PM, "nitroys" notifications@github.com wrote:

I'm looking over the model right now, and I agree with you, Joseph, that the log years model is the most interesting and significant. However, I think that the neighborhood id variable should stay in the model, despite it not being significant, because the differences in the neighborhoods (whether their slopes are different) is what we're really interested in. Do you mind rerunning just the log years model, keeping that variable in? I can keep writing for now and fill in the details later, so no rush on that. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JestonBlu/Neighbor-Works/issues/5#issuecomment-282596641, or mute the thread https://github.com/notifications/unsubscribe-auth/ADL2hZSA5itx0befIB9SlKAWlTXVvOZ8ks5rggaagaJpZM4MFQ1j .

JestonBlu commented 7 years ago

OKay, ive added back the neighborhoods id. I also added tukey comparisons to each of the factors. I only updated what appears to be the best model so far log(Years) and I saved the reports in html, ppt, and pdf. Tomorrow I will work on updating the SAS script to match what I did in JMP. Any other feedback?

nitroys commented 7 years ago

Thanks for getting that done so quickly! It looks good.

JestonBlu / Neighbor-Works

Modeling and Analysis #5