UBC-MDS / dsci-522_group-28

This is the repository for DSCI 522 Group 28
MIT License
0 stars 6 forks source link

Choosing dataset #3

Closed debanandasarkar closed 3 years ago

debanandasarkar commented 3 years ago

This issue is for choosing project dataset and identify inferential or predictive analysis scope.

Please comment on the following datasets or feel free to propose new datasets:

JaredSplinter commented 3 years ago

My votes are for the Hotel Dataset as I think a good predictive model could be made fairly easily or the Covid-19 John Hopkins University Dataset as there are lots of information to formulate a question from. However, it is a massive dataset (I noticed 58,042 rows) and it is still being added onto, there were entries corresponding to today November 20th so it is constantly changing.

JaredSplinter commented 3 years ago

Alternatively, I found this data set from the CDC here. There are datasets for the number of tests performed and the percent positive for each of the states as well as other interesting datasets that can be found.

There is also a lot of COVID-19 datasets at HealthData.gov here if we are interested in exploring others.

debanandasarkar commented 3 years ago

My votes are for the Hotel Dataset as I think a good predictive model could be made fairly easily or the Covid-19 John Hopkins University Dataset as there are lots of information to formulate a question from. However, it is a massive dataset (I noticed 58,042 rows) and it is still being added onto, there were entries corresponding to today November 20th so it is constantly changing.

I am also leaning towards these two datasets. If we end up using JHU data, we can restrict our analysis for first wave only. That was we can control the data and make it static for our analysis

debanandasarkar commented 3 years ago

Here is the github link for hotel data https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-02-11

Looks like its okay to use this in terms of license.

@xudongyang2 : Can you please check as well?

xudong-Y commented 3 years ago

I did some research on the license and access of the hotel dataset journal. The "get rights and content" link under the title of this journal directs me to a site that states that "This is an open access article distributed under the terms of the Creative Commons CC-BY license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. You are not required to obtain permission to reuse this article."

This article is labeled as open access by this website. It has an explanation of what it means by "open access" : "All articles in open access journals which are published by Elsevier have undergone peer review and upon acceptance are immediately and permanently free for everyone to read and download."

https://www.elsevier.com/open-access/open-access-journals

Let me know what you think of this.

xudong-Y commented 3 years ago

I'm good with either of these 2 datasets. The hotel dataset is easier, however covid is a very hot topic right now and it would be cool to do some analysis around it. I am a little concerned on the covid data if we do inference analysis on it, since this is population data or at least not random sampling data? We need to pay attention to data independence as well.

chenzhao2020 commented 3 years ago

Inferential questions for dataset candidates

  1. Death rate between different province
  2. time of first wave peak of different province
  3. predict time of second wave peak
  1. Avg temp for diff countries
  2. highest or avg temp for northern and southern hemisphere
  3. temp increasing rate between equatorial and pole
  1. stay nights for diff seasons
  2. covid impact on booking rate
  3. covid impact on cancel rate
  4. parking rates for diff seasons

similar with temp dataset from Berkeley Earth

  1. Death rate between diff country
  2. Infected rate for diff countries
  3. both rates for northern and southern hemisphere
  4. both rates for people with diff age
chenzhao2020 commented 3 years ago

I'm good with either of these 2 datasets. The hotel dataset is easier, however covid is a very hot topic right now and it would be cool to do some analysis around it. I am a little concerned on the covid data if we do inference analysis on it, since this is population data or at least not random sampling data? We need to pay attention to data independence as well.

Good points on the independence

JaredSplinter commented 3 years ago

Hotel Dataset chosen. link