Guiding Questions:
What tools are you choosing and why?
Have you ensured your data's integrity?
What steps have you taken to verify your data is clean?
How can you verify data is clean and ready to analyze?
Have you documented your cleaning process?
-Check for data errors
-Transform the data into the right type
-Document the cleaning process
-Choose your tools
Guiding questions
● What tools are you choosing and why?
● Have you ensured your data’s integrity?
● What steps have you taken to ensure that your data is clean?
● How can you verify that your data is clean and ready to analyze?
● Have you documented your cleaning process so you can review and share those results?
Key tasks
Check the data for errors.
Choose your tools.
Transform the data so you can work with it effectively.
Document the cleaning process.
Deliverable
Documentation of any cleaning or manipulation of data
Step 2:
Cleaning the data. First we need to find a way to simplify this information and to clean any possible errors. To do this, first we use analyze which information is relevant from which data tables and then slowly look for duplicate data, outdated data, incomplete data, incorrect/inaccurate data or inconsistent data!
Brief review:
-Duplicate data: any data record that shows up more than once caused by manual data entry, batch data imports or data migration.
-Outdated data: Any data that is old which should be replaced with newer and more accurate information caused by people changing roles or companies, or software or systems becoming obsolete.
-Incomplete data: any data that is missing important fields caused by improper data collection or incorrect data entry.
-Incorrect/Inaccurate data: any data that is complete but inaccurate caused by human error inserted during data input, fake information or mock data.
-Inconsistent data: any data that uses different formats to represent the same thing caused by data stored incorrectly or errors inserted during data transfer.
To clean data with SQL, it is necessary to do it VIA queries. One by one, I will analyze the different tables, eliminate any of the possible errors, identify which data can be human input to further analyze for null, duplicate, outdated, incomplete or incorrect etc.
First started with Average Calories based on the Daily Activity data which happens to have all the different daily activity data merged into one file. (To verify this, I compared how many DISTINCT Id are included in all the different daily tables that were provided). Then did a query to analyze the Average Calorie per User, to see how many calories people are doing daily. I needed to understand from when to when the data was captured so I did another Query and ordered it by Date. This let me know that people are burning an approximate of 2500 calories a day for the month that this data was provided from, which is spring. Would be interesting to know if this changes drastically in winter or where the users are located.
To further understand the information I went back to the Kaggle to read the description from the site that provided the data set and it has this important following information:
This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.
Which then we can understand that data tracked from before 03.12 or after 05.12 is incorrect as well as the data Ids because it has 33 users in contrast to what is said on the description which says 30.
Guiding Questions: What tools are you choosing and why? Have you ensured your data's integrity? What steps have you taken to verify your data is clean? How can you verify data is clean and ready to analyze? Have you documented your cleaning process? -Check for data errors -Transform the data into the right type -Document the cleaning process -Choose your tools
Guiding questions ● What tools are you choosing and why? ● Have you ensured your data’s integrity? ● What steps have you taken to ensure that your data is clean? ● How can you verify that your data is clean and ready to analyze? ● Have you documented your cleaning process so you can review and share those results? Key tasks
Step 2: Cleaning the data. First we need to find a way to simplify this information and to clean any possible errors. To do this, first we use analyze which information is relevant from which data tables and then slowly look for duplicate data, outdated data, incomplete data, incorrect/inaccurate data or inconsistent data! Brief review: -Duplicate data: any data record that shows up more than once caused by manual data entry, batch data imports or data migration. -Outdated data: Any data that is old which should be replaced with newer and more accurate information caused by people changing roles or companies, or software or systems becoming obsolete. -Incomplete data: any data that is missing important fields caused by improper data collection or incorrect data entry. -Incorrect/Inaccurate data: any data that is complete but inaccurate caused by human error inserted during data input, fake information or mock data. -Inconsistent data: any data that uses different formats to represent the same thing caused by data stored incorrectly or errors inserted during data transfer.
To clean data with SQL, it is necessary to do it VIA queries. One by one, I will analyze the different tables, eliminate any of the possible errors, identify which data can be human input to further analyze for null, duplicate, outdated, incomplete or incorrect etc.
First started with Average Calories based on the Daily Activity data which happens to have all the different daily activity data merged into one file. (To verify this, I compared how many DISTINCT Id are included in all the different daily tables that were provided). Then did a query to analyze the Average Calorie per User, to see how many calories people are doing daily. I needed to understand from when to when the data was captured so I did another Query and ordered it by Date. This let me know that people are burning an approximate of 2500 calories a day for the month that this data was provided from, which is spring. Would be interesting to know if this changes drastically in winter or where the users are located.
To further understand the information I went back to the Kaggle to read the description from the site that provided the data set and it has this important following information:
This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.
Which then we can understand that data tracked from before 03.12 or after 05.12 is incorrect as well as the data Ids because it has 33 users in contrast to what is said on the description which says 30.