Brainstorming Topic #1

Open ejcer opened 8 years ago

ejcer commented 8 years ago

This place is where we'll brainstorm everything

ejcer commented 8 years ago

Practice/Warm up/(not for actual project): https://www.kaggle.com/c/titanic

I'm actually really interested in this one, because of this: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

and this course actually used the titanic dataset during their decision tree lectures to determine who lives and dies and visualize it: https://www.coursera.org/course/datasci

I feel like it would be a good way to get a feel for how kaggle works

student-rahul commented 8 years ago

Great.. I'll check these out. My account is rahul@vt.edu

ejcer commented 8 years ago

This was posted on the project page: http://www.data.gov/

Inside you can find: http://www.data.gov/education/

in this we could find ways to see what factors impact the performance of students. For example, how much sleep they get, or if they have divorced parents or something of that sort, or how far they live from school

ejcer commented 8 years ago

We could use the presidential campaign data from here: https://www.quora.com/Jeff-Hammerbacher/Introduction-to-Data-Science-Data-Sets

to predict the presidential elections based on polling and how much money they raise for their campaign

ejcer commented 8 years ago

We could use the census data from here: http://aws.amazon.com/public-data-sets/

as a training set and see if a model we create is any good on future values

ejcer commented 8 years ago

Lol, the memetracker dataset from here: http://snap.stanford.edu/data/

ejcer commented 8 years ago

I would be interested in doing something business related. For example, what if we had a data set that included datapoints like the capital a business has on hand, and also a bunch of other factors. We could then model the growth rate of a business on it's other attributes, as well as, market factors, or possibly something funny like how many social media accounts post about it, or how much the business is talked about in the news

student-rahul commented 8 years ago

Yeah, I would like to do something which is business related. Though I am not sure if we can get that kind of financial data. Let's search for that first.

ejcer commented 8 years ago

going with the weather idea that you had, we could do weather + the effect it has on consumer spending vs the effect it has on online consumer spending

ejcer commented 8 years ago

Also, some notes on what we saw today: housing data, crime data, and agricultural production data

ejcer commented 8 years ago

Questions for Dr. Davanloo, what ratio is he looking for in terms of modeling and visualization?

What external libraries is he looking for? I ask this, because if we're allowed to use an external library like this one: http://spacy.io/ to do the heavy lifting in terms of parsing language, then we might have an easier time using social media data. EX: http://qr.ae/RPhSZX so like is a classification heavy modeling project okay is what I'm asking? EX: a classifier to figure out what subreddit a reddit submission should go in: https://www.reddit.com/r/datasets/top?sort=top&t=all

ejcer commented 8 years ago

The NOAA website has a tool that can extract between two periods of time daily data points http://www.ncdc.noaa.gov/cdo-web/search?datasetid=PRECIP_HLY#

Possibly use in combination with the yelp dataset to predict reviews based on weather conditions in that area

ejcer commented 8 years ago

job market data

ejcer commented 8 years ago

transportation airline dataset:


ejcer commented 8 years ago

I'd be interested in seeing the comparison of software job salaries to that of mechanical engineering salaries in the 50's

ejcer commented 8 years ago

I like this kaggle competition: https://www.kaggle.com/c/2013-american-community-survey

student-rahul commented 8 years ago

Interesting Kaggle competitions (All of them are over):

  1. Wind Energy Forecasting - https://www.kaggle.com/c/GEF2012-wind-forecasting
  2. Air Quality prediction - https://www.kaggle.com/c/dsg-hackathon
  3. XBox game prediction - https://www.kaggle.com/c/acm-sf-chapter-hackathon-small
  4. Tourism - https://www.kaggle.com/c/tourism1
  5. Wikipedia hierarchical data classification - https://www.kaggle.com/c/lshtc
student-rahul commented 8 years ago

Interesting DataSets:

  1. NYC Taxi data 2013 (The file size is in gigs!!) https://archive.org/details/nycTaxiTripData2013] http://www.andresmh.com/nyctaxitrips/
  2. Yelp Dataset challenge 2015 http://www.yelp.com/dataset_challenge
  3. Use Google BigQuery on GDELT database to create own datasets.
utility rates

personal income data with location

where is the README for this? These labels make no sense

possibly use in combination with zillow data to create a map of which realestate prices increased based on the average income of the person

how income of different states are changing over time and the dependence of different

covariance matrix of different states and show if this matrix is changing over time. Inverse of covariance matrix precision matrix. positive covariance means both go up

conditional independence between states.

time series median home price data by location

possibly use in combination with personal income data to create a map of which realestate prices increased based on the average income of the person

ejcer commented 8 years ago

How would we normalize something like the yelp vs weather idea?

ejcer commented 8 years ago

yelp dataset idea: Categorize reviewers based on how likely they are to show sudden signs of extreme enthusiasm to identify trend setters?

ejcer commented 8 years ago

to those interested in following along on this bus, here's our final proposal:


we chose this topic over the yelp dataset, because of the following reasons:

  1. We actually have some creative ownership as compared to being shoved in with the yelp contestant crowd
  2. There's a ton of totally hidden information in this data set. A quick google of this data set reveals not many people have investigated it. (I don't blame them, it is the IRS after all... Kinda dry stuff)
  3. The data fits really really well with a d3.js visualization of the US map with a color scheme.
  4. We don't feel like dealing with a mongo database, nor do we feel like making a schema for data that fits a mongo database.
  5. A fantastic deliverable of this project is a clean [insert your desired flavor of SQL] database, because of the highly fragmented nature of the IRS dataset.
  6. From a project management perspective, 6 weeks is a very tight budget for creating something worth submitting to yelp's challenge. Choosing the IRS data set gives us time to make something that's worth sharing.