Welcome to my data cleaning project! In this project, I worked with a dataset from the "NashvilleHousing" table in the "PortfolioProject" database using SQL queries. I'll walk you through what I did, how I did it, and the results I achieved.
Step 1: Standardize Date Format
To begin, I noticed that the date format in the "SaleDate" column was inconsistent. To address this, I used the CONVERT function to standardize the date format to the Date data type. After the conversion, I updated the "SaleDate" column in the "NashvilleHousing" table, ensuring a uniform date format. As a backup, I also added a new column called "SaleDateConverted" to store the converted dates.
Step 2: Populate Property Address Data
I identified records with missing "PropertyAddress" values. To fix this, I cleverly joined the "NashvilleHousing" table with itself based on the "ParcelID" and found records with missing "PropertyAddress" values. I then updated these records by filling them with the "PropertyAddress" from matching records, effectively completing the dataset.
Step 3: Breaking out Address into Individual Columns
Address information was initially stored in a single column, making it less usable. I used the SUBSTRING function to split the "PropertyAddress" into separate "Address" and "City" columns, making it more structured. Two new columns, "PropertySplitAddress" and "PropertySplitCity," were added to accommodate the split values.
Step 4: Splitting Owner Address
Similar to the "PropertyAddress," I split the "OwnerAddress" into separate columns for "Address," "City," and "State." This was achieved using the PARSENAME and REPLACE functions. New columns, "OwnerSplitAddress," "OwnerSplitCity," and "OwnerSplitState," were introduced to store these split values, making it easier to analyze.
Step 5: Change 'Y' and 'N' to 'Yes' and 'No' in "Sold as Vacant" Field
In the "SoldAsVacant" column, 'Y' and 'N' represented values that could be better understood as 'Yes' and 'No.' I transformed these values using a CASE statement, making the data more intuitive.
Step 6: Remove Duplicates
To ensure data integrity, I removed duplicate records. I used a Common Table Expression (CTE) with the ROW_NUMBER() function to assign unique row numbers based on specific columns. Only records with row numbers greater than 1 were selected, effectively eliminating duplicates.
Step 7: Delete Unused Columns
Lastly, I identified and removed columns that were no longer necessary in the dataset. Using the ALTER TABLE statement, I dropped columns such as "OwnerAddress," "TaxDistrict," "PropertyAddress," and "SaleDate," streamlining the dataset for better clarity.
Step 8: Importing Data using OPENROWSET and BULK INSERT (Optional)
As an optional advanced step, I provided information on how to import data from an external source using either the BULK INSERT or OPENROWSET method. This demonstrates an alternative way to add data to your database.
In conclusion, this data cleaning project involved a series of steps aimed at improving data quality, structure, and usability. The cleaned dataset is now ready for further analysis or reporting, and I'm excited to showcase this project in my portfolio!
/*
Cleaning Data in SQL Queries
*/
Select *
From PortfolioProject.dbo.NashvilleHousing
-- Standardize Date Format
Select saleDateConverted, CONVERT(Date,SaleDate)
From PortfolioProject.dbo.NashvilleHousing
Update NashvilleHousing
SET SaleDate = CONVERT(Date,SaleDate)
-- If it doesn't Update properly
ALTER TABLE NashvilleHousing
Add SaleDateConverted Date;
Update NashvilleHousing
SET SaleDateConverted = CONVERT(Date,SaleDate)
-- Populate Property Address data
Select *
From PortfolioProject.dbo.NashvilleHousing
--Where PropertyAddress is null
order by ParcelID
Select a.ParcelID, a.PropertyAddress, b.ParcelID, b.PropertyAddress, ISNULL(a.PropertyAddress,b.PropertyAddress)
From PortfolioProject.dbo.NashvilleHousing a
JOIN PortfolioProject.dbo.NashvilleHousing b
on a.ParcelID = b.ParcelID
AND a.[UniqueID ] <> b.[UniqueID ]
Where a.PropertyAddress is null
Update a
SET PropertyAddress = ISNULL(a.PropertyAddress,b.PropertyAddress)
From PortfolioProject.dbo.NashvilleHousing a
JOIN PortfolioProject.dbo.NashvilleHousing b
on a.ParcelID = b.ParcelID
AND a.[UniqueID ] <> b.[UniqueID ]
Where a.PropertyAddress is null
-- Breaking out Address into Individual Columns (Address, City, State)
Select PropertyAddress
From PortfolioProject.dbo.NashvilleHousing
--Where PropertyAddress is null
--order by ParcelID
SELECT
SUBSTRING(PropertyAddress, 1, CHARINDEX(',', PropertyAddress) -1 ) as Address
, SUBSTRING(PropertyAddress, CHARINDEX(',', PropertyAddress) + 1 , LEN(PropertyAddress)) as Address
From PortfolioProject.dbo.NashvilleHousing
ALTER TABLE NashvilleHousing
Add PropertySplitAddress Nvarchar(255);
ALTER TABLE NashvilleHousing
Add OwnerSplitAddress Nvarchar(255);
Update NashvilleHousing
SET OwnerSplitAddress = PARSENAME(REPLACE(OwnerAddress, ',', '.') , 3)
ALTER TABLE NashvilleHousing
Add OwnerSplitCity Nvarchar(255);
Update NashvilleHousing
SET OwnerSplitCity = PARSENAME(REPLACE(OwnerAddress, ',', '.') , 2)
ALTER TABLE NashvilleHousing
Add OwnerSplitState Nvarchar(255);
Update NashvilleHousing
SET OwnerSplitState = PARSENAME(REPLACE(OwnerAddress, ',', '.') , 1)
Select *
From PortfolioProject.dbo.NashvilleHousing
-- Change Y and N to Yes and No in "Sold as Vacant" field
Select Distinct(SoldAsVacant), Count(SoldAsVacant)
From PortfolioProject.dbo.NashvilleHousing
Group by SoldAsVacant
order by 2
Select SoldAsVacant
, CASE When SoldAsVacant = 'Y' THEN 'Yes'
When SoldAsVacant = 'N' THEN 'No'
ELSE SoldAsVacant
END
From PortfolioProject.dbo.NashvilleHousing
Update NashvilleHousing
SET SoldAsVacant = CASE When SoldAsVacant = 'Y' THEN 'Yes'
When SoldAsVacant = 'N' THEN 'No'
ELSE SoldAsVacant
END
-- Remove Duplicates
WITH RowNumCTE AS(
Select *,
ROW_NUMBER() OVER (
PARTITION BY ParcelID,
PropertyAddress,
SalePrice,
SaleDate,
LegalReference
ORDER BY
UniqueID
) row_num
From PortfolioProject.dbo.NashvilleHousing
--order by ParcelID
)
Select *
From RowNumCTE
Where row_num > 1
Order by PropertyAddress
Select *
From PortfolioProject.dbo.NashvilleHousing
-- Delete Unused Columns
Select *
From PortfolioProject.dbo.NashvilleHousing
ALTER TABLE PortfolioProject.dbo.NashvilleHousing
DROP COLUMN OwnerAddress, TaxDistrict, PropertyAddress, SaleDate
--- Importing Data using OPENROWSET and BULK INSERT
-- More advanced and looks cooler, but have to configure server appropriately to do correctly
-- Wanted to provide this in case you wanted to try it
--USE PortfolioProject;
--GO
--BULK INSERT nashvilleHousing FROM 'C:\Temp\SQL Server Management Studio\Nashville Housing Data for Data Cleaning Project.csv'
-- WITH (
-- FIELDTERMINATOR = ',',
-- ROWTERMINATOR = '\n'
--);
--GO
---- Using OPENROWSET
--USE PortfolioProject;
--GO
--SELECT * INTO nashvilleHousing
--FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0',
-- 'Excel 12.0; Database=C:\Users\alexf\OneDrive\Documents\SQL Server Management Studio\Nashville Housing Data for Data Cleaning Project.csv', [Sheet1$]);
--GO
Thanks so much for the hard work done by putting your work for some of us to learn from. It helped me solve a my cleaning work that i was struggling with. Thanks again.
Data Cleaning Project for My Portfolio
Introduction:
Welcome to my data cleaning project! In this project, I worked with a dataset from the "NashvilleHousing" table in the "PortfolioProject" database using SQL queries. I'll walk you through what I did, how I did it, and the results I achieved.
Step 1: Standardize Date Format
To begin, I noticed that the date format in the "SaleDate" column was inconsistent. To address this, I used the CONVERT function to standardize the date format to the Date data type. After the conversion, I updated the "SaleDate" column in the "NashvilleHousing" table, ensuring a uniform date format. As a backup, I also added a new column called "SaleDateConverted" to store the converted dates.
Step 2: Populate Property Address Data
I identified records with missing "PropertyAddress" values. To fix this, I cleverly joined the "NashvilleHousing" table with itself based on the "ParcelID" and found records with missing "PropertyAddress" values. I then updated these records by filling them with the "PropertyAddress" from matching records, effectively completing the dataset.
Step 3: Breaking out Address into Individual Columns
Address information was initially stored in a single column, making it less usable. I used the SUBSTRING function to split the "PropertyAddress" into separate "Address" and "City" columns, making it more structured. Two new columns, "PropertySplitAddress" and "PropertySplitCity," were added to accommodate the split values.
Step 4: Splitting Owner Address
Similar to the "PropertyAddress," I split the "OwnerAddress" into separate columns for "Address," "City," and "State." This was achieved using the PARSENAME and REPLACE functions. New columns, "OwnerSplitAddress," "OwnerSplitCity," and "OwnerSplitState," were introduced to store these split values, making it easier to analyze.
Step 5: Change 'Y' and 'N' to 'Yes' and 'No' in "Sold as Vacant" Field
In the "SoldAsVacant" column, 'Y' and 'N' represented values that could be better understood as 'Yes' and 'No.' I transformed these values using a CASE statement, making the data more intuitive.
Step 6: Remove Duplicates
To ensure data integrity, I removed duplicate records. I used a Common Table Expression (CTE) with the ROW_NUMBER() function to assign unique row numbers based on specific columns. Only records with row numbers greater than 1 were selected, effectively eliminating duplicates.
Step 7: Delete Unused Columns
Lastly, I identified and removed columns that were no longer necessary in the dataset. Using the ALTER TABLE statement, I dropped columns such as "OwnerAddress," "TaxDistrict," "PropertyAddress," and "SaleDate," streamlining the dataset for better clarity.
Step 8: Importing Data using OPENROWSET and BULK INSERT (Optional)
As an optional advanced step, I provided information on how to import data from an external source using either the BULK INSERT or OPENROWSET method. This demonstrates an alternative way to add data to your database.
In conclusion, this data cleaning project involved a series of steps aimed at improving data quality, structure, and usability. The cleaned dataset is now ready for further analysis or reporting, and I'm excited to showcase this project in my portfolio!
/*
Cleaning Data in SQL Queries
*/
Select * From PortfolioProject.dbo.NashvilleHousing
-- Standardize Date Format
Select saleDateConverted, CONVERT(Date,SaleDate) From PortfolioProject.dbo.NashvilleHousing
Update NashvilleHousing SET SaleDate = CONVERT(Date,SaleDate)
-- If it doesn't Update properly
ALTER TABLE NashvilleHousing Add SaleDateConverted Date;
Update NashvilleHousing SET SaleDateConverted = CONVERT(Date,SaleDate)
-- Populate Property Address data
Select * From PortfolioProject.dbo.NashvilleHousing --Where PropertyAddress is null order by ParcelID
Select a.ParcelID, a.PropertyAddress, b.ParcelID, b.PropertyAddress, ISNULL(a.PropertyAddress,b.PropertyAddress) From PortfolioProject.dbo.NashvilleHousing a JOIN PortfolioProject.dbo.NashvilleHousing b on a.ParcelID = b.ParcelID AND a.[UniqueID ] <> b.[UniqueID ] Where a.PropertyAddress is null
Update a SET PropertyAddress = ISNULL(a.PropertyAddress,b.PropertyAddress) From PortfolioProject.dbo.NashvilleHousing a JOIN PortfolioProject.dbo.NashvilleHousing b on a.ParcelID = b.ParcelID AND a.[UniqueID ] <> b.[UniqueID ] Where a.PropertyAddress is null
-- Breaking out Address into Individual Columns (Address, City, State)
Select PropertyAddress From PortfolioProject.dbo.NashvilleHousing --Where PropertyAddress is null --order by ParcelID
SELECT SUBSTRING(PropertyAddress, 1, CHARINDEX(',', PropertyAddress) -1 ) as Address , SUBSTRING(PropertyAddress, CHARINDEX(',', PropertyAddress) + 1 , LEN(PropertyAddress)) as Address
From PortfolioProject.dbo.NashvilleHousing
ALTER TABLE NashvilleHousing Add PropertySplitAddress Nvarchar(255);
Update NashvilleHousing SET PropertySplitAddress = SUBSTRING(PropertyAddress, 1, CHARINDEX(',', PropertyAddress) -1 )
ALTER TABLE NashvilleHousing Add PropertySplitCity Nvarchar(255);
Update NashvilleHousing SET PropertySplitCity = SUBSTRING(PropertyAddress, CHARINDEX(',', PropertyAddress) + 1 , LEN(PropertyAddress))
Select * From PortfolioProject.dbo.NashvilleHousing
Select OwnerAddress From PortfolioProject.dbo.NashvilleHousing
Select PARSENAME(REPLACE(OwnerAddress, ',', '.') , 3) ,PARSENAME(REPLACE(OwnerAddress, ',', '.') , 2) ,PARSENAME(REPLACE(OwnerAddress, ',', '.') , 1) From PortfolioProject.dbo.NashvilleHousing
ALTER TABLE NashvilleHousing Add OwnerSplitAddress Nvarchar(255);
Update NashvilleHousing SET OwnerSplitAddress = PARSENAME(REPLACE(OwnerAddress, ',', '.') , 3)
ALTER TABLE NashvilleHousing Add OwnerSplitCity Nvarchar(255);
Update NashvilleHousing SET OwnerSplitCity = PARSENAME(REPLACE(OwnerAddress, ',', '.') , 2)
ALTER TABLE NashvilleHousing Add OwnerSplitState Nvarchar(255);
Update NashvilleHousing SET OwnerSplitState = PARSENAME(REPLACE(OwnerAddress, ',', '.') , 1)
Select * From PortfolioProject.dbo.NashvilleHousing
-- Change Y and N to Yes and No in "Sold as Vacant" field
Select Distinct(SoldAsVacant), Count(SoldAsVacant) From PortfolioProject.dbo.NashvilleHousing Group by SoldAsVacant order by 2
Select SoldAsVacant , CASE When SoldAsVacant = 'Y' THEN 'Yes' When SoldAsVacant = 'N' THEN 'No' ELSE SoldAsVacant END From PortfolioProject.dbo.NashvilleHousing
Update NashvilleHousing SET SoldAsVacant = CASE When SoldAsVacant = 'Y' THEN 'Yes' When SoldAsVacant = 'N' THEN 'No' ELSE SoldAsVacant END
-- Remove Duplicates
WITH RowNumCTE AS( Select *, ROW_NUMBER() OVER ( PARTITION BY ParcelID, PropertyAddress, SalePrice, SaleDate, LegalReference ORDER BY UniqueID ) row_num
From PortfolioProject.dbo.NashvilleHousing --order by ParcelID ) Select * From RowNumCTE Where row_num > 1 Order by PropertyAddress
Select * From PortfolioProject.dbo.NashvilleHousing
-- Delete Unused Columns
Select * From PortfolioProject.dbo.NashvilleHousing
ALTER TABLE PortfolioProject.dbo.NashvilleHousing DROP COLUMN OwnerAddress, TaxDistrict, PropertyAddress, SaleDate
--- Importing Data using OPENROWSET and BULK INSERT
-- More advanced and looks cooler, but have to configure server appropriately to do correctly -- Wanted to provide this in case you wanted to try it
--sp_configure 'show advanced options', 1; --RECONFIGURE; --GO --sp_configure 'Ad Hoc Distributed Queries', 1; --RECONFIGURE; --GO
--USE PortfolioProject
--GO
--EXEC master.dbo.sp_MSset_oledb_prop N'Microsoft.ACE.OLEDB.12.0', N'AllowInProcess', 1
--GO
--EXEC master.dbo.sp_MSset_oledb_prop N'Microsoft.ACE.OLEDB.12.0', N'DynamicParameters', 1
--GO
---- Using BULK INSERT
--USE PortfolioProject; --GO --BULK INSERT nashvilleHousing FROM 'C:\Temp\SQL Server Management Studio\Nashville Housing Data for Data Cleaning Project.csv' -- WITH ( -- FIELDTERMINATOR = ',', -- ROWTERMINATOR = '\n' --); --GO
---- Using OPENROWSET --USE PortfolioProject; --GO --SELECT * INTO nashvilleHousing --FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0', -- 'Excel 12.0; Database=C:\Users\alexf\OneDrive\Documents\SQL Server Management Studio\Nashville Housing Data for Data Cleaning Project.csv', [Sheet1$]); --GO