JestonBlu / Neighbor-Works

Masters Project: Statistical Research
0 stars 0 forks source link

Ideas on NeighborWorks Models #15

Closed profgeraci closed 5 years ago

profgeraci commented 7 years ago

I am tempted to try to create a model that to PREDICT the RecommendCat value (of 2,3,4) using the same variables that we used in our first model:

RecommendCat ~ NeighborhoodID + Age + Gender + FeelSafeNightCat + SatLevelCat + Race + SnowRemovalCat + ParticipationCat + OwnRent + PoliceRating + FeelSafeDayCat + TrashRatingCat

..... or we could simplify it to what I think are the most significant:

RecommendCat ~ NeighborhoodID + Age + Gender + OwnRent + PoliceRating + FeelSafeDayCat

So, we have a covariate (Age), a factor of interest (NeighborhoodID) and some additional factors (Gender, OwnRent, PoliceRating, FeelSafe).

So, is this a logistic model or a binomial model? I think it's worth doing, even if everything ends up being NOT statistically significant.

I've created a document on our shared Google Drive:

https://docs.google.com/a/tamu.edu/document/d/11cNOBC_FxmT1P6Y6FBHWPJvZLJ_Z3K3aemGamxgGOcs/edit?usp=sharing

... everyone should be able to edit that document (you may have to login to TAMU). Any ideas are welcome.

@JestonBlu @nitroys @rmglazner

JestonBlu commented 7 years ago

Its similar, but its technically a multinomial model since there are more than 2 levels in the response. I'm not sure if the Prof A. will like us doing that since she hasn't gone over that in class. If you are playing around with it in R, you can use the nnet package with function multinom to fit it.

On Sat, Apr 8, 2017 at 10:25 AM, NancyDrew484 notifications@github.com wrote:

I am tempted to try to create a model that to PREDICT the RecommendCat value (of 2,3,4) using the same variables that we used in our first model:

RecommendCat ~ NeighborhoodID + Age + Gender + FeelSafeNightCat + SatLevelCat + Race + SnowRemovalCat + ParticipationCat + OwnRent + PoliceRating + FeelSafeDayCat + TrashRatingCat

..... or we could simplify it to what I think are the most significant:

RecommendCat ~ NeighborhoodID + Age + Gender + OwnRent + PoliceRating + FeelSafeDayCat

So, we have a covariate (Age), a factor of interest (NeighborhoodID) and some additional factors (Gender, OwnRent, PoliceRating, FeelSafe).

So, is this a logistic model or a binomial model? I think it's worth doing, even if everything ends up being NOT statistically significant.

I've created a document on our shared Google Drive:

https://docs.google.com/a/tamu.edu/document/d/11cNOBC_FxmT1P6Y6FBHWPJvZLJ_ Z3K3aemGamxgGOcs/edit?usp=sharing

... everyone should be able to edit that document (you may have to login to TAMU). Any ideas are welcome.

@JestonBlu https://github.com/JestonBlu @nitroys https://github.com/nitroys @rmglazner https://github.com/rmglazner

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JestonBlu/Neighbor-Works/issues/15, or mute the thread https://github.com/notifications/unsubscribe-auth/ADL2hcjHd17zLvuGuE0gM5aioIip8USDks5rt6b_gaJpZM4M3vAV .

profgeraci commented 7 years ago

OK, I see, so that's why you suggesting collapsing it down to a 0 (not) and 1 (recommend). I see.

profgeraci commented 7 years ago

OK, so I was able to do a logistic regression using Excel and a special add-in called StatTools. Here's what I got:

2017-04-11_1605

I'm not sure how to do this in SAS, but it seems like this model would meet the requirements of the project. NeighborhoodID is not significant, but it is a factor of interest. I can't really do interactions with this tool, but I would be interested in the interaction between AgeFeelSafeDay or AgeChangePast3Y.

Here's the Excel file if you want to look at some of the other models that I did: NWRoc 2016 Survey Data.xlsx

Are we going to get together on Hangout tonight or tomorrow night? I'll be away for the Easter weekend - leaving right after our Thursday exam.

JestonBlu commented 7 years ago

It might be better to meet after the test. Ive pretty much been focusing on that lately. I think what you have is a good start.. you do have significant predictors for both safety and past 3y change, but the general model fit is pretty low. Balanced accuracy would be around 63% with the low true negative accuracy and and a rough r^2 estimate would be around .35 so the model still wouldn't explain a about whats going on. I tend to think thats just the nature of this data where > 90% of people recommended their neighborhoods... Did you have to transform the categorical data into dummy variables for it to work in excel?

On Tue, Apr 11, 2017 at 3:08 PM, NancyDrew484 notifications@github.com wrote:

OK, so I was able to do a logistic regression using Excel and a special add-in called StatTools. Here's what I got:

[image: 2017-04-11_1605] https://cloud.githubusercontent.com/assets/8108754/24928530/bda82c22-1ed0-11e7-9ec1-bbd142e41df5.png

I'm not sure how to do this in SAS, but it seems like this model would meet the requirements of the project. NeighborhoodID is not significant, but it is a factor of interest. I can't really do interactions with this tool, but I would be interested in the interaction between AgeFeelSafeDay or AgeChangePast3Y.

Here's the Excel file if you want to look at some of the other models that I did: NWRoc 2016 Survey Data.xlsx https://github.com/JestonBlu/Neighbor-Works/files/914523/NWRoc.2016.Survey.Data.xlsx

Are we going to get together on Hangout tonight or tomorrow night? I'll be away for the Easter weekend - leaving right after our Thursday exam.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JestonBlu/Neighbor-Works/issues/15#issuecomment-293385290, or mute the thread https://github.com/notifications/unsubscribe-auth/ADL2he7yDgNtMYW4ElOoaz5WTmZBWDLbks5ru93OgaJpZM4M3vAV .

rmglazner commented 7 years ago

The model seems to work with the data, but I agree with Joseph that maybe it is not the best fit. A lot more people recommend than don't recommend, which could be an issue. Unfortunately there is not much data from my field that I can suggest either, as I am still in the early sample collection phase.

Should we talk to the professor about our options while we still have time?

I also agree that it may be better to meet after the exam, especially if we do decide to communicate with Dr. Akleman before then.

JestonBlu commented 7 years ago

I do have some data on pesticide treatments for horn flies on cattle herds that I have been using in the consulting course Im currently in. I suppose we could see if we could use that. Dr. Akleman may not want us to use something ive used in another class, but if she doesn't mind then that might be a good option.

Worth a shot?

On Tue, Apr 11, 2017 at 4:29 PM, rmglazner notifications@github.com wrote:

The model seems to work with the data, but I agree with Joseph that maybe it is not the best fit. A lot more people recommend than don't recommend, which could be an issue. Unfortunately there is not much data from my field that I can suggest either, as I am still in the early sample collection phase.

Should we talk to the professor about our options while we still have time?

I also agree that it may be better to meet after the exam, especially if we do decide to communicate with Dr. Akleman before then.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JestonBlu/Neighbor-Works/issues/15#issuecomment-293405832, or mute the thread https://github.com/notifications/unsubscribe-auth/ADL2hVc5fTG81cpWC5qbV1tT826tcFkMks5ru_DEgaJpZM4M3vAV .

nitroys commented 7 years ago

I'm also going to be away for Easter starting Thursday, so talking next week would be better for me too. I say asking about the consulting data is worth a shot. If Dr.A doesn't like that idea, I can poke around at what data I might have at work (that's publicly available).

nitroys commented 7 years ago

So I double checked at work this morning, and we have a subscription here to WRDS that can be used for "academic and research purposes." This is data from the Wharton school at UPenn. Here's a page with what datasets are available, if you want to take a look:

I work with a few of these datasets on regular basis, but I'm very aware that finance isn't everyone's favorite or strong suit (still not mine, really, even after 3 years). Just an idea, though!

profgeraci commented 7 years ago

Sounds good, Shannon. I don't see a link in your note. Could you download the data that you think might be suitable and upload it to this site?

If we are going to approach Dr. A, we should do that THIS WEEK. Can someone please compose an e-mail that states what our questions are?

JestonBlu commented 7 years ago

I can send her an email today about it. I have added the data set here. https://github.com/JestonBlu/Neighbor-Works/tree/master/data

Its called hornflydata.csv

On Wed, Apr 12, 2017 at 11:36 AM, NancyDrew484 notifications@github.com wrote:

Sounds good, Shannon. I don't see a link in your note. Could you download the data that you think might be suitable and upload it to this site?

If we are going to approach Dr. A, we should do that THIS WEEK. Can someone please compose an e-mail that states what our questions are?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JestonBlu/Neighbor-Works/issues/15#issuecomment-293636396, or mute the thread https://github.com/notifications/unsubscribe-auth/ADL2hWPSeqbLyusjvqdwQUKXHJkXvactks5rvP2GgaJpZM4M3vAV .

rmglazner commented 7 years ago

Which data set are we emailing her about - Joseph's or Shannon's?

JestonBlu commented 7 years ago

Sorry, i missed Shannon's email this morning. That might be preferable to what I have.

On Wed, Apr 12, 2017 at 12:06 PM, rmglazner notifications@github.com wrote:

Which data set are we emailing her about - Joseph's or Shannon's?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JestonBlu/Neighbor-Works/issues/15#issuecomment-293644756, or mute the thread https://github.com/notifications/unsubscribe-auth/ADL2hb3C3SkRHmtIKOTHdFl07FI3QycEks5rvQSUgaJpZM4M3vAV .

rmglazner commented 7 years ago

Shannon, can you please resend the link to your data so we can determine which one might be best?

nitroys commented 7 years ago

Whoops, my bad. I tried to insert a hyperlink, but I guess it didn't take.

http://www.whartonwrds.com/our-datasets/

nitroys commented 7 years ago

The datasets I've used include the Fixed income securities database (FISD), which contains info on corporate bonds; CRSP, which is stock price info; TRACE, which is data on bond trades (and messy so I don't recommend), and what they're calling Bank Regulatory, which is bank balance sheet information, so assets, liabilities, etc.

rmglazner commented 7 years ago

Thank you! I know very little about finance so the decision is up to you all! Is there a data set that you think would fit best with logistic regression?

Also, during last night's Q&A I asked Dr. Akleman if it was okay if I recorded a video for our presentation rather than present in person, which she said was okay since we all live outside of College Station. Once we get going and have a draft put together perhaps I can record a video and send it to you all for any suggestions for changes before the day it is due?

profgeraci commented 7 years ago

Joseph: I can't seem to access the HornFly.csv file. Can you re-upload it?

JestonBlu commented 7 years ago

Yeah, here it is.

On Wed, Apr 12, 2017 at 4:38 PM, NancyDrew484 notifications@github.com wrote:

Joseph: I can't seem to access the HornFly.csv file. Can you re-upload it?

  • Anne

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JestonBlu/Neighbor-Works/issues/15#issuecomment-293715344, or mute the thread https://github.com/notifications/unsubscribe-auth/ADL2hYGGeeAh39Sbh4JF9JvZM15fngJ5ks5rvURRgaJpZM4M3vAV .

JestonBlu commented 7 years ago

I took a look at those data sets and I thought the American Hospital Association data looked interesting. I suppose we could look at their survey data of hospital demographics and perform some sort of logistic regression depending on what all is available. I thought the homework question about c-section count differences between private and public hospitals was pretty interesting. Maybe we could find something similar in that data?

On Wed, Apr 12, 2017 at 5:59 PM, Joseph Blubaugh jestonblu@gmail.com wrote:

Yeah, here it is.

On Wed, Apr 12, 2017 at 4:38 PM, NancyDrew484 notifications@github.com wrote:

Joseph: I can't seem to access the HornFly.csv file. Can you re-upload it?

  • Anne

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JestonBlu/Neighbor-Works/issues/15#issuecomment-293715344, or mute the thread https://github.com/notifications/unsubscribe-auth/ADL2hYGGeeAh39Sbh4JF9JvZM15fngJ5ks5rvURRgaJpZM4M3vAV .

nitroys commented 7 years ago

I'll take a look at work tomorrow and see what it looks like!

nitroys commented 7 years ago

Ugh unfortunately, we aren't subscribed to that dataset, so I can't access it. My bad, i should have checked on whether I had to be subscribed in order to get access. There are a lot of subscriptions we do have, though. Here's the list:

image

nitroys commented 7 years ago

I think the DMEF Academic data could be interesting. Here's the description:

"Four individual data sets, each containing customer buying history for about 100,000 customers of nationally known catalog and non-profit database marketing businesses are available through DMEF to approved academic researchers for use within academic situations.

Corporate names are anonymous and customer names and addresses have been removed, but the business type is indicated. ZIP codes have been retained (if possible) to provide a potential link to Census ZIP level demographics."

Thoughts?

rmglazner commented 7 years ago

What are some of the predictors for that data? Would business type be the binary response for logistic regression?

nitroys commented 7 years ago

Variable Reference Variable Name Data Type Variable Description Help ACCNTNMB CHAR Donor ID
CHNGDATE NUM Change of Address Date
CNCOD1 NUM Latest
CNCOD10 NUM 10th
CNCOD2 NUM 2nd
CNCOD3 NUM 3rd
CNCOD4 NUM 4th
CNCOD5 NUM 5th
CNCOD6 NUM 6th
CNCOD7 NUM 7th
CNCOD8 NUM 8th
CNCOD9 NUM 9th
CNDAT1 NUM Latest
CNDAT10 NUM 10th
CNDAT2 NUM 2nd
CNDAT3 NUM 3rd
CNDAT4 NUM 4th
CNDAT5 NUM 5th
CNDAT6 NUM 6th
CNDAT7 NUM 7th
CNDAT8 NUM 8th
CNDAT9 NUM 9th
CNDOL1 NUM Latest
CNDOL10 NUM 10th
CNDOL2 NUM 2nd
CNDOL3 NUM 3rd
CNDOL4 NUM 4th
CNDOL5 NUM 5th
CNDOL6 NUM 6th
CNDOL7 NUM 7th
CNDOL8 NUM 8th
CNDOL9 NUM 9th
CNTMLIF NUM Times Contributed Lifetime
CNTRLIF NUM Dollars Contribution Lifetime
CONLARG NUM Largest Contribution
CONTRFST NUM First Contribution
DATEFST NUM First Contribution Date
DATELRG NUM Largest Contribution Date
FIRMCOD CHAR Firm/Head HH code
MEMBCODE CHAR Membership Code
NEW NUM Data Set Group
NEW1 NUM In First 25% (Based on Uniform Distribution)
NEW2 NUM In Second 25% (Based on Uniform Distribution)
NEW3 NUM In Third 25% (Based on Uniform Distribution)
NEW4 NUM In Last 25% (Based on Uniform Distribution)
NOCLBCOD CHAR No Club Contact Code
NONPRCOD CHAR No Premium Contact Code
NORETCOD CHAR No Return Postage Code
NOSUSCOD CHAR No Sustain Fund Code
PREFCODE CHAR Preferred Contributor Code
REINCODE CHAR Reinstatement Code
REINDATE NUM Reinstatement Date
RENTCODE CHAR Rental Exclusion Code
SECADRIN CHAR 2nd Address Indicator
SEX CHAR Gender
SLCOD1 NUM Latest
SLCOD10 NUM 10th
SLCOD11 NUM 11th
SLCOD2 NUM 2nd
SLCOD3 NUM 3rd
SLCOD4 NUM 4th
SLCOD5 NUM 5th
SLCOD6 NUM 6th
SLCOD7 NUM 7th
SLCOD8 NUM 8th
SLCOD9 NUM 9th
SLDAT1 NUM Latest
SLDAT10 NUM 10th
SLDAT11 NUM 11th
SLDAT2 NUM 2nd
SLDAT3 NUM 3rd
SLDAT4 NUM 4th
SLDAT5 NUM 5th
SLDAT6 NUM 6th
SLDAT7 NUM 7th
SLDAT8 NUM 8th
SLDAT9 NUM 9th
SLTMLIF NUM Times Solicitated Lifetime
STATCODE CHAR State
TARGDOL NUM Dollars of Fall 1995 Donations
TARGRESP NUM Number of Fall 1995 Donations
ZIPCODE CHAR Zip

These are all the variables for the non-profit database. It's information on donations made by individuals and companies to non-profits during the Fall of 1995 (which is honestly pretty old, so I can keep searching if that's unsettling). I was thinking a Poisson regression for the number of donations made would be interesting-- we could examine the coefficients for individuals versus companies, male versus female, geographical differences, etc.

My only concern now is that it's so old that it probably doesn't allow us to say much about current donation tendencies.

nitroys commented 7 years ago

Oh, there's also a dataset about taxi trips in NYC in 2014 and 2015. Here are the variables for that;

DROPOFF_DATE DATE Dropoff Date
DROPOFF_LATITUDE NUM Dropoff Latitude
DROPOFF_LONGITUDE NUM Dropoff Longitude
DROPOFF_TIME NUM Dropoff Time
FARE_AMOUNT NUM Fare Amount
MTA_TAX NUM MTA Tax
PASSENGER_COUNT NUM Passenger Count
PAYMENT_TYPE CHAR Payment Type
PICKUP_DATE DATE Pickup Date
PICKUP_LATITUDE NUM Pickup Latitude
PICKUP_LONGITUDE NUM Pickup Longitude
PICKUP_TIME NUM Pickup Time
RATE_CODE NUM Rate Code
STORE_AND_FWD_FLAG CHAR Store and Forward Flag
SURCHARGE NUM Surcharge
TIP_AMOUNT NUM Tip Amount
TOLLS_AMOUNT NUM Tolls Amount
TOTAL_AMOUNT NUM Total Amount
TRIP_DISTANCE NUM Trip Distance (in miles)
VENDOR_ID CHAR Vendor ID

For this, I'm thinking about a random coefficient model to predict tip amount, where the random effect would be vendor ID.

rmglazner commented 7 years ago

Thank you for sharing that additional information!

My thought at this point would be to ask Dr. Akleman if we can use Joseph's fly data since we already have it and it seems to be suitable for our purposes. If we spend too long looking for more data sets we lose time for writing our paper and preparing the presentation. What do you all think about that? I tried opening the fly data, but I am unable to do so. I am assuming it includes at least 4 predictors? Also, what would be the binary response for logistic regression?

profgeraci commented 7 years ago

I agree with Rachael that if Dr. A let's us use the fly data, that's the best path right now. If not, I say we just stay with the NWRoc data and explain that we didn't get any significant results, but we can explain them anyway (sometimes that happens in the real world).

Joseph - since you know the most about the fly data, could you compose an e-mail to Dr. A?

JestonBlu commented 7 years ago

I just sent her an email about it, but now that I think about it, there aren't really 4 predictors.... its count data so it wouldn't be logistic regression, it would be closer to poisson regression with a an AR error term because the data was collected weekly. A general model form would be fly_count ~ week + county + treatment... so technically there are 3 predictors and there is also control herds that didn't receive any treatment.

I do like the idea of using the cab data... if everyone is set on logistic regression we could predict whether the cab driver was tipped at all rather than predicting the tip amount. So maybe something like Tip ~ tolls + pickup_time + trip_distance + payment_type... something like that?

On Thu, Apr 13, 2017 at 8:59 AM, NancyDrew484 notifications@github.com wrote:

I agree with Rachael that if Dr. A let's us use the fly data, that's the best path right now. If not, I say we just stay with the NWRoc data and explain that we didn't get any significant results, but we can explain them anyway (sometimes that happens in the real world).

Joseph - since you know the most about the fly data, could you compose an e-mail to Dr. A?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JestonBlu/Neighbor-Works/issues/15#issuecomment-293903497, or mute the thread https://github.com/notifications/unsubscribe-auth/ADL2hR3baJCnU8zaGH6Wn2dVf64HzYr3ks5rvipcgaJpZM4M3vAV .

profgeraci commented 7 years ago

Joseph: Are there no interaction terms that we might consider for the fly data?

I like your idea of the cab data as well.

JestonBlu commented 7 years ago

There is enough data to include interactions as well.

On Apr 13, 2017 10:25 AM, "NancyDrew484" notifications@github.com wrote:

Joseph: Are there no interaction terms that we might consider for the fly data?

I like your idea of the cab data as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JestonBlu/Neighbor-Works/issues/15#issuecomment-293928123, or mute the thread https://github.com/notifications/unsubscribe-auth/ADL2hQi_l_IB2VrGW47Aq_D8wAKUjDZ4ks5rvj5ogaJpZM4M3vAV .

JestonBlu commented 7 years ago

Shannon, would you mind going ahead and pulling the taxi data and making that available?

On Thu, Apr 13, 2017 at 10:27 AM, Joseph Blubaugh jestonblu@gmail.com wrote:

There is enough data to include interactions as well.

On Apr 13, 2017 10:25 AM, "NancyDrew484" notifications@github.com wrote:

Joseph: Are there no interaction terms that we might consider for the fly data?

I like your idea of the cab data as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JestonBlu/Neighbor-Works/issues/15#issuecomment-293928123, or mute the thread https://github.com/notifications/unsubscribe-auth/ADL2hQi_l_IB2VrGW47Aq_D8wAKUjDZ4ks5rvj5ogaJpZM4M3vAV .

nitroys commented 7 years ago

Yup. I've uploaded a single month of the 2014 data in a SAS dataset for now. The entire year is very large, so we can definitely go bigger than this, but I wouldn't recommend trying to do the whole year.

It also looks like there are enough 0 and non-zero tips that we could use your idea, Joseph. I was worried there might not be enough of one or the other.

JestonBlu commented 7 years ago

Where did you put the file?

On Thu, Apr 13, 2017 at 12:24 PM, nitroys notifications@github.com wrote:

Yup. I've uploaded a single month of the 2014 data in a SAS dataset for now. The entire year is very large, so we can definitely go bigger than this, but I wouldn't recommend trying to do the whole year.

It also looks like there are enough 0 and non-zero tips that we could use your idea, Joseph. I was worried there might not be enough of one or the other.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JestonBlu/Neighbor-Works/issues/15#issuecomment-293967012, or mute the thread https://github.com/notifications/unsubscribe-auth/ADL2heJAQMyAnOsfoIHI_y7uHFfyswupks5rvlprgaJpZM4M3vAV .

nitroys commented 7 years ago

Sorry. I said that and planned to upload it immediately, but I didn't realize Github has a 25MB limit. So I've actually only pulled two days of data to make it under that limit. It's in the data folder now.

We might not be able to use this dataset with that restriction, unless I run the models without you all being able to see the entire dataset.....

JestonBlu commented 7 years ago

What if you pulled a sample of 100K records from the entire year? That way we could test things like if people tip more in december... or if they are more likely to tip more during different seasons or times of the day... that might be interesting... i certainly dont think we need to ingest all of the available data.

On Thu, Apr 13, 2017 at 12:43 PM, nitroys notifications@github.com wrote:

Sorry. I said that and planned to upload it immediately, but I didn't realize Github has a 25MB limit. So I've actually only pulled two days of data to make it under that limit. It's in the data folder now.

We might not be able to use this dataset with that restriction, unless I run the models without you all being able to see the entire dataset.....

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JestonBlu/Neighbor-Works/issues/15#issuecomment-293971938, or mute the thread https://github.com/notifications/unsubscribe-auth/ADL2hQb6ZiuGi5YhTN0FZWpWbXcZCbREks5rvl7ZgaJpZM4M3vAV .

nitroys commented 7 years ago

Oh that's a good idea. I'll do that and add it here.

nitroys commented 7 years ago

Okay, I've added the sample. I did a stratified random sample by months, so we can look at those effects if we decide to. The sample has 120,000 observations, since I picked 10000 from each month.

JestonBlu commented 7 years ago

Nice, thanks a lot!

On Thu, Apr 13, 2017 at 5:21 PM, nitroys notifications@github.com wrote:

Okay, I've added the sample. I did a stratified random sample by months, so we can look at those effects if we decide to. The sample has 120,000 observations, since I picked 10000 from each month.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JestonBlu/Neighbor-Works/issues/15#issuecomment-294036879, or mute the thread https://github.com/notifications/unsubscribe-auth/ADL2hVxBl6gP6CYjO_lhNmB7SGX0xv5Aks5rvqACgaJpZM4M3vAV .

rmglazner commented 7 years ago

The taxi data looks perfect for what we need. Thank you very much, Shannon! Do we still need to email Dr. Akleman about anything at this point or should we move forward with the stratified random sample of the taxi data now?

Also, how should we go about dividing the tasks? It seems like the way we divided last time worked fairly well, but I am happy to contribute in any way as you all see fit. I have also just shared a Google document where we can begin writing and editing each other's work. There is nothing on it but a title, I just wanted to have it available for all of us whenever we are ready.

JestonBlu commented 7 years ago

Yeah, I was just playing with and it looks promising, so my vote is that we go with that. We dont need to contact Dr. Akleman about anything. She responded to my email and said that we could use the fly data if we didnt analyze it the same way i already have, but honestly I think the taxi data is much better anyway.

Im also okay with the same tasks, but i dont want to hog the model building if someone else wants to lead it. Im pretty good at putting together nice looking plots so I can contribute to that to the paper and presentation as well.

On Thu, Apr 13, 2017 at 6:13 PM, rmglazner notifications@github.com wrote:

The taxi data looks perfect for what we need. Thank you very much, Shannon! Do we still need to email Dr. Akleman about anything at this point or should we move forward with the stratified random sample of the taxi data now?

Also, how should we go about dividing the tasks? It seems like the way we divided last time worked fairly well, but I am happy to contribute in any way as you all see fit. I have also just shared a Google document where we can begin writing and editing each other's work. There is nothing on it but a title, I just wanted to have it available for all of us whenever we are ready.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JestonBlu/Neighbor-Works/issues/15#issuecomment-294045054, or mute the thread https://github.com/notifications/unsubscribe-auth/ADL2hbovm50f3utjX7L3XmwJhkK-t_v6ks5rvqwjgaJpZM4M3vAV .

rmglazner commented 7 years ago

Where can I find some background information about the taxi data? I am comfortable with helping create the presentation, recording the presentation (and making changes to the video based on feedback), and writing some of the report. If others would like me to help with model building, I am happy to try, but I will admit that it is probably my weakest area of the project. If possible, I would like to focus on the presentation and report writing!

nitroys commented 7 years ago

I did a quick google and found this site, which I think is talking about the same dataset, on yellow taxis: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

It's a publicly available dataset, so the source I pulled it from doesn't maintain the documentation. I included the variable list in a PDF in the data folder. I did check them against our dataset, and the only differences are that the variables vendor ID and payment type are character codes and not numeric ones. But it's pretty easy to figure out what codes mean what.

nitroys commented 7 years ago

I'm also willing to help write some models and/or part of the report!

profgeraci commented 7 years ago

Hi everyone. I'm following this convo while I'm on the road. Going to Boston to celebrate my moms 80th birthday! Obviously I'll help with whatever I can. I'm fine with the same roles as last time. I won't be able to look at the taxi data until Monday but I'm ok with this approach.

On Apr 14, 2017, at 8:00 AM, nitroys notifications@github.com wrote:

I'm also willing to help write some models and/or part of the report!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

JestonBlu commented 7 years ago

I think this data is really interesting... here are some quick insights that I have found. It looks like there isn't necessarily a difference in tip percentages by month, but it certainly looks like there is one based on the time of day.

[image: Inline image 2]

[image: Inline image 1]

On Fri, Apr 14, 2017 at 8:34 AM, NancyDrew484 notifications@github.com wrote:

Hi everyone. I'm following this convo while I'm on the road. Going to Boston to celebrate my moms 80th birthday! Obviously I'll help with whatever I can. I'm fine with the same roles as last time. I won't be able to look at the taxi data until Monday but I'm ok with this approach.

  • Anne

On Apr 14, 2017, at 8:00 AM, nitroys notifications@github.com wrote:

I'm also willing to help write some models and/or part of the report!

— You are receiving this because you authored the thread.

Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JestonBlu/Neighbor-Works/issues/15#issuecomment-294154310, or mute the thread https://github.com/notifications/unsubscribe-auth/ADL2hewY6Ux3hq57Ldeo-pwdx0QOIsw8ks5rv3XngaJpZM4M3vAV .

rmglazner commented 7 years ago

Thank you for sharing the background information, Shannon! I will try to start putting together an introduction to the report on the Google document later today.

Joseph, the images you tried posting here did not appear unfortunately, but the trend you have noticed seems interesting. Just so that I am understanding your modeling correctly, when you say "tip percentages" does that mean tips are binary responses (0 no tip, 1 tip) and the percent is the percent of drives where there is a tip?

Also, happy birthday to your mom, Anne!

JestonBlu commented 7 years ago

Sorry, i was doing it through email instead of the discussion thread. It must not have liked that. Here are the plots I was talking about.

I think the wording is going to be tricky, but yes i created a binary response, so the percentages reflect the proportion of rides where the driver received a tip... not the tip as a percentage of the total fare.

image

image

rmglazner commented 7 years ago

The second graph does look very promising!

rmglazner commented 7 years ago

Also, I will be away tomorrow and Sunday, so I won't be able to respond to anything until Monday. I just wanted to let you all know!

JestonBlu commented 7 years ago

I should have some time to do some exploratory analysis this weekend. Ill start a new thread for that.