Closed JestonBlu closed 5 years ago
So far Im pulling the project requirements from this section of the Background issue.
The agency would like to know if there is a statistical and significant difference
between the four geographical sub-neighborhoods (designated 1 through 4 on the map)
for the following aspects of community life:
a) Satisfaction Level
b) Participation in the Community
c) Willingness to Become Involved
d) Opinion on Police Response
e) Safety
f) Community Improvement Perceptions
g) Homeownership
h) Age, racial, and gender composition
(2) The second part to this project is what this STAT653 group (that's the four of us) will do.
Since we have been learning about the various analysis methods, I'm hoping that we can
provide the "statistically significant" part of the NWRoc request.
I think there is a little bit of difference between what we have learned in class so far and what this project is going require. All of the response variables in class have been numeric, but what we have is categorical so it really requires a different type of modeling. I took categorical data analysis last spring so Im familiar with what we need to do, but I wonder if maybe we should check with Dr. Akleman again and make sure we are allowed to use other methods.
In terms of satisfying the interests of NW, Chi-squared GOF tests will help determine if there are differences in responses between the neighborhoods. We can also add a third variable (for instance testing satisfaction by neighborhood across the levels of police rating) and do a CMH test. One issue I have had so far is that there are very few responses using the 1 or 2 rating for satisfaction or recommendation so in order to get a meaningful test I needed to combine the responses of 1 and 2 into 3. I have created override columns for the variables in question and left the original data alone for now. I will also have to combine responses if there are not enough observations in each neighborhood for the other variables of interest.
I did a simple Chi-Squared GOF between the neighborhoods satisfaction and recommendation responses. The low pvalue indicates at the 95% Confidence Level that there are differences between the 4 neighborhoods.
If everyone thinks this is the right direction so far then I will continue these tests for all of the variables in list.
@nitroys @rmglazner @NancyDrew484
I think we should confirm that we can use a likert variable as a response. She really didn't specify that it had to be numeric but if it's not going to work I would rather bail now.
I had some other data related to Gainful Employment from various institutions of Higher Education, but now that I look at it, I'm not sure there's a good numerical response variable there either.
Here's the NYT article that piqued my interest: https://www.nytimes.com/2014/02/26/business/economy/the-bane-and-the-boon-of-for-profit-colleges.html?_r=0
And the files: GE-DMYR-2015-Final-Rates.xlsx GE_SSA_Earnings_2014.xlsx
Does anyone else have any @nitroys @rmglazner @JestonBlu
Could satisfaction level also be continuous? My guess is probably not, but I thought it might be worth consideration. Since the values of 1, 2, and 3 had to be combined I don't know if this type of model would be the most useful reflection of the response. Having that said, I think what has been done so far would definitely work if that is the only response data we have to work with. Higher ed completion rates sounds like it would be a potentially good replacement of Satisfaction and those values would definitely be continuous, right?
I agree that we should check to make sure our model falls within Dr. Aklemans expectations before we continue, but I also agree that @JestonBlu is on the right track with the chi squared tests. I can send Dr. A an email, if we all agree.
Okay, sending an email sounds like a good idea!
I think Dr. Akleman strongly prefers that we post something in the discussion board rather than e-mail. She didn't seem to be watching the "Project Data" discussion too carefully - I suggest we post something in the "General Discussion" asking for clarification. Can you please that, Shannon (@nitroys)? My computer is acting up (just had to reboot and restart the browser twice to get this posted).
@nitroys @rmglazner @JestonBlu
Dr. Akleman's answer is kinda frustrating, but it look like we are going to need to find some new data to analyze. Frustrating in that we were told to just "find some data to analyze" (without restriction) and now that we see the assignment, the instructions should have been "find some data you can analyze with these specific types of experimental designs".
OK, so now what are we going to do? We need to find some new data. Let's start a new "issue" for each type of data so the threads don't get too long.
Joseph: Do I need to do this: @nitroys @rmglazner @JestonBlu on every issue, or do we all get notified when new replies are posted?
So I don't necessarily think we need to find a totally new dataset and start from scratch. We could do an ANCOVA style analysis for this first part, sort of a "naive" first pass as estimation, and estimate the ages or number of years people have lived in the different regions. This is more numeric response, and while it may not get at the programs goals, it wouldn't be an invalid analysis for us to do for the first part. What do you guys think?
Ohhhhh, I like it! Good idea! Let's go with Age, since that variable seems to be a little more accurate - most of the "Years" (how long have you lived here) answers seem to be estimates (10 years, 24 years).
I'll start working on a report format today.
I think you may need to tag people initially so they are aware of the new issue thread, but once they create a post or reply then I think you should get notifications automatically.. you can see who is on the thread on at the beginning of this issue on the right hand side under participants... so everyone should be getting notified of this post...
I think the ANCOVA idea is good using the years as the response variable. I agree that its the most numeric of all the columns and there isnt any missing data. I think we can make that work. I would also like to produce the results that NW originally requested, but Ill do that as side work. We dont have to plan on incorporating that into our presentation or write up.
I agree with everyone that years would serve as a good response variable! It is continuous and indirectly reflects satisfaction.
Okay, ill start working on that then... everyone if you wouldnt mind please list some variables that are you most interested in seeing the model. I wont be able to include all of them for sure, but if you have in that you want to see in be sure to let me know.
Here is my "short list" of variables:
Justification: If we are trying to explain AGE of the resident, I would hypothesize that Older residents might be more involved (higher Participation Score) in the neighborhood because they have more free time and motivation to preserve the neightborhood.
What other hypotheses might we have (without looking at the data itself) that we can test in this data?
My preference would be to use years rather than age, but it will be simple to run both so I will do that and post the results so we can decide on the direction. I think all of those variables are good... i thought it might be interesting to see the safety perception is different between the night and day ratings.
I agree with Anne on the variables to include. Better to keep it simple than throw everything into the model at once.
I think it will be interesting to see both Age and Years represented by the same model (since I feel like they have some kind of relationship themselves-- older people are more likely to have lived there longer). Could you maybe also include a correlation plot between the two, just for reference?
Will do.
On Tue, Feb 21, 2017 at 7:57 AM, nitroys notifications@github.com wrote:
I agree with Anne on the variables to include. Better to keep it simple than throw everything into the model at once.
I think it will be interesting to see both Age and Years represented by the same model (since I feel like they have some kind of relationship themselves-- older people are more likely to have lived there longer). Could you maybe also include a correlation plot between the two, just for reference?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JestonBlu/Neighbor-Works/issues/5#issuecomment-281351007, or mute the thread https://github.com/notifications/unsubscribe-auth/ADL2hff6HHkX2jwF3IGngb9rc_ISKqI-ks5reu1SgaJpZM4MFQ1j .
I also agree with Anna for this. I also agree that it would be interesting to compare age and years, and see if the two factors have any correlation.
Okay, i have posted some preliminary results. I created 4 models in total. 2 using years as a response, and 2 using Age as a response. I tried one of each with a log transformation. They all have similar performance with a low R2. The log(years) model has the best R2 with .43. I think its also the most interesting of the other models. It has the highest number of significant variables as well.
The models are all stored in the ReducedData.jmp file. You should be able to open it up and see them in the saved scripts windows. I also ran a univariate report on Age and Years and I ran a simple model of Years ~ Age. Age is by far the most significant variable for explaining Years and the same is true with you do Age ~ Years. Intuitively this makes sense to me.
I initially tried using the data as is, but the fits were really bad I think due to major lack of variation for some of the variables. I ended up classifying most of the variables used so far as low, medium, high. I also copied the data to a new tab of the spreadsheet so its a little easier to look at.
I created a summary.md file in the analysis directory that shows all of the models I created plus the equation form which I think we will have to show in the presentation and write up. I haven't dont any real interpretation yet, but I wanted to post what I have so you all can have time to look it over and give me some feedback.
I suggest we pick only one of the models to talk about and put in our report/presentation. Take a look at all of them and lets discuss which is most interesting to everyone.
Joseph, is it easy for you to upload your model output as a pdf, maybe? I've been using SAS for the class because I have SAS at work and I don't have a great computer to download JMP on. If it's too much trouble for you I can figure that out, but if you don't mind, it would be very helpful. Thanks!
Yeah, ill see if I can do that. I believe I can also export the model scripts as a SAS script too so Ill do that as well. Probably wont be able to get to it till the morning though, but ill do it first thing.
Awesome, no rush. Thanks!!
Hi everyone, I have uploaded all of those reports in a few different formats (pdf, html, ppt). Hopefully that will make it easier to build the report and presentation. I also converted the data to a sas data set, but I wasnt able to generate the code for the same sas equivalent reports. I imagine we will probably be migrating to sas only soon though.
Please take a look and let me know if anything doesnt look right.
Hey everyone. I have now added a SAS script as well. This one is just the model with Years ~ Age + PoliceRating + SafeDay + OwnRent. I also tested for interactions with Age and Age OwnRent was significant so I added that as well. I also uploaded a SAS pdf of the results with some basic regression plots. Let me know what you all think.
I'm looking over the model right now, and I agree with you, Joseph, that the log years model is the most interesting and significant. However, I think that the neighborhood id variable should stay in the model, despite it not being significant, because the differences in the neighborhoods (whether their slopes are different) is what we're really interested in. Do you mind rerunning just the log years model, keeping that variable in? I can keep writing for now and fill in the details later, so no rush on that. Thanks!
Not at all. That's a good point. I'll do that.
On Feb 26, 2017 5:11 PM, "nitroys" notifications@github.com wrote:
I'm looking over the model right now, and I agree with you, Joseph, that the log years model is the most interesting and significant. However, I think that the neighborhood id variable should stay in the model, despite it not being significant, because the differences in the neighborhoods (whether their slopes are different) is what we're really interested in. Do you mind rerunning just the log years model, keeping that variable in? I can keep writing for now and fill in the details later, so no rush on that. Thanks!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JestonBlu/Neighbor-Works/issues/5#issuecomment-282596641, or mute the thread https://github.com/notifications/unsubscribe-auth/ADL2hZSA5itx0befIB9SlKAWlTXVvOZ8ks5rggaagaJpZM4MFQ1j .
OKay, ive added back the neighborhoods id. I also added tukey comparisons to each of the factors. I only updated what appears to be the best model so far log(Years) and I saved the reports in html, ppt, and pdf. Tomorrow I will work on updating the SAS script to match what I did in JMP. Any other feedback?
Thanks for getting that done so quickly! It looks good.
Okay, so here is an issue to post anything you want regarding modeling methods or analysis. I will update you all on this thread as i progress so feel free to give feedback here at any point. @nitroys @rmglazner @NancyDrew484