bcompton1 / PortfolioProjects

1 stars 0 forks source link

Data Cleaning in SQL #1

Open bcompton1 opened 11 months ago

bcompton1 commented 11 months ago

Data Cleaning Project for My Portfolio

Introduction:

Welcome to my data cleaning project! In this project, I worked with a dataset from the "NashvilleHousing" table in the "PortfolioProject" database using SQL queries. I'll walk you through what I did, how I did it, and the results I achieved.

Step 1: Standardize Date Format

To begin, I noticed that the date format in the "SaleDate" column was inconsistent. To address this, I used the CONVERT function to standardize the date format to the Date data type. After the conversion, I updated the "SaleDate" column in the "NashvilleHousing" table, ensuring a uniform date format. As a backup, I also added a new column called "SaleDateConverted" to store the converted dates.

Step 2: Populate Property Address Data

I identified records with missing "PropertyAddress" values. To fix this, I cleverly joined the "NashvilleHousing" table with itself based on the "ParcelID" and found records with missing "PropertyAddress" values. I then updated these records by filling them with the "PropertyAddress" from matching records, effectively completing the dataset.

Step 3: Breaking out Address into Individual Columns

Address information was initially stored in a single column, making it less usable. I used the SUBSTRING function to split the "PropertyAddress" into separate "Address" and "City" columns, making it more structured. Two new columns, "PropertySplitAddress" and "PropertySplitCity," were added to accommodate the split values.

Step 4: Splitting Owner Address

Similar to the "PropertyAddress," I split the "OwnerAddress" into separate columns for "Address," "City," and "State." This was achieved using the PARSENAME and REPLACE functions. New columns, "OwnerSplitAddress," "OwnerSplitCity," and "OwnerSplitState," were introduced to store these split values, making it easier to analyze.

Step 5: Change 'Y' and 'N' to 'Yes' and 'No' in "Sold as Vacant" Field

In the "SoldAsVacant" column, 'Y' and 'N' represented values that could be better understood as 'Yes' and 'No.' I transformed these values using a CASE statement, making the data more intuitive.

Step 6: Remove Duplicates

To ensure data integrity, I removed duplicate records. I used a Common Table Expression (CTE) with the ROW_NUMBER() function to assign unique row numbers based on specific columns. Only records with row numbers greater than 1 were selected, effectively eliminating duplicates.

Step 7: Delete Unused Columns

Lastly, I identified and removed columns that were no longer necessary in the dataset. Using the ALTER TABLE statement, I dropped columns such as "OwnerAddress," "TaxDistrict," "PropertyAddress," and "SaleDate," streamlining the dataset for better clarity.

Step 8: Importing Data using OPENROWSET and BULK INSERT (Optional)

As an optional advanced step, I provided information on how to import data from an external source using either the BULK INSERT or OPENROWSET method. This demonstrates an alternative way to add data to your database.

In conclusion, this data cleaning project involved a series of steps aimed at improving data quality, structure, and usability. The cleaned dataset is now ready for further analysis or reporting, and I'm excited to showcase this project in my portfolio!

/*

Cleaning Data in SQL Queries

*/

Select * From PortfolioProject.dbo.NashvilleHousing


-- Standardize Date Format

Select saleDateConverted, CONVERT(Date,SaleDate) From PortfolioProject.dbo.NashvilleHousing

Update NashvilleHousing SET SaleDate = CONVERT(Date,SaleDate)

-- If it doesn't Update properly

ALTER TABLE NashvilleHousing Add SaleDateConverted Date;

Update NashvilleHousing SET SaleDateConverted = CONVERT(Date,SaleDate)


-- Populate Property Address data

Select * From PortfolioProject.dbo.NashvilleHousing --Where PropertyAddress is null order by ParcelID

Select a.ParcelID, a.PropertyAddress, b.ParcelID, b.PropertyAddress, ISNULL(a.PropertyAddress,b.PropertyAddress) From PortfolioProject.dbo.NashvilleHousing a JOIN PortfolioProject.dbo.NashvilleHousing b on a.ParcelID = b.ParcelID AND a.[UniqueID ] <> b.[UniqueID ] Where a.PropertyAddress is null

Update a SET PropertyAddress = ISNULL(a.PropertyAddress,b.PropertyAddress) From PortfolioProject.dbo.NashvilleHousing a JOIN PortfolioProject.dbo.NashvilleHousing b on a.ParcelID = b.ParcelID AND a.[UniqueID ] <> b.[UniqueID ] Where a.PropertyAddress is null


-- Breaking out Address into Individual Columns (Address, City, State)

Select PropertyAddress From PortfolioProject.dbo.NashvilleHousing --Where PropertyAddress is null --order by ParcelID

SELECT SUBSTRING(PropertyAddress, 1, CHARINDEX(',', PropertyAddress) -1 ) as Address , SUBSTRING(PropertyAddress, CHARINDEX(',', PropertyAddress) + 1 , LEN(PropertyAddress)) as Address

From PortfolioProject.dbo.NashvilleHousing

ALTER TABLE NashvilleHousing Add PropertySplitAddress Nvarchar(255);

Update NashvilleHousing SET PropertySplitAddress = SUBSTRING(PropertyAddress, 1, CHARINDEX(',', PropertyAddress) -1 )

ALTER TABLE NashvilleHousing Add PropertySplitCity Nvarchar(255);

Update NashvilleHousing SET PropertySplitCity = SUBSTRING(PropertyAddress, CHARINDEX(',', PropertyAddress) + 1 , LEN(PropertyAddress))

Select * From PortfolioProject.dbo.NashvilleHousing

Select OwnerAddress From PortfolioProject.dbo.NashvilleHousing

Select PARSENAME(REPLACE(OwnerAddress, ',', '.') , 3) ,PARSENAME(REPLACE(OwnerAddress, ',', '.') , 2) ,PARSENAME(REPLACE(OwnerAddress, ',', '.') , 1) From PortfolioProject.dbo.NashvilleHousing

ALTER TABLE NashvilleHousing Add OwnerSplitAddress Nvarchar(255);

Update NashvilleHousing SET OwnerSplitAddress = PARSENAME(REPLACE(OwnerAddress, ',', '.') , 3)

ALTER TABLE NashvilleHousing Add OwnerSplitCity Nvarchar(255);

Update NashvilleHousing SET OwnerSplitCity = PARSENAME(REPLACE(OwnerAddress, ',', '.') , 2)

ALTER TABLE NashvilleHousing Add OwnerSplitState Nvarchar(255);

Update NashvilleHousing SET OwnerSplitState = PARSENAME(REPLACE(OwnerAddress, ',', '.') , 1)

Select * From PortfolioProject.dbo.NashvilleHousing


-- Change Y and N to Yes and No in "Sold as Vacant" field

Select Distinct(SoldAsVacant), Count(SoldAsVacant) From PortfolioProject.dbo.NashvilleHousing Group by SoldAsVacant order by 2

Select SoldAsVacant , CASE When SoldAsVacant = 'Y' THEN 'Yes' When SoldAsVacant = 'N' THEN 'No' ELSE SoldAsVacant END From PortfolioProject.dbo.NashvilleHousing

Update NashvilleHousing SET SoldAsVacant = CASE When SoldAsVacant = 'Y' THEN 'Yes' When SoldAsVacant = 'N' THEN 'No' ELSE SoldAsVacant END


-- Remove Duplicates

WITH RowNumCTE AS( Select *, ROW_NUMBER() OVER ( PARTITION BY ParcelID, PropertyAddress, SalePrice, SaleDate, LegalReference ORDER BY UniqueID ) row_num

From PortfolioProject.dbo.NashvilleHousing --order by ParcelID ) Select * From RowNumCTE Where row_num > 1 Order by PropertyAddress

Select * From PortfolioProject.dbo.NashvilleHousing


-- Delete Unused Columns

Select * From PortfolioProject.dbo.NashvilleHousing

ALTER TABLE PortfolioProject.dbo.NashvilleHousing DROP COLUMN OwnerAddress, TaxDistrict, PropertyAddress, SaleDate



--- Importing Data using OPENROWSET and BULK INSERT

-- More advanced and looks cooler, but have to configure server appropriately to do correctly -- Wanted to provide this in case you wanted to try it

--sp_configure 'show advanced options', 1; --RECONFIGURE; --GO --sp_configure 'Ad Hoc Distributed Queries', 1; --RECONFIGURE; --GO

--USE PortfolioProject

--GO

--EXEC master.dbo.sp_MSset_oledb_prop N'Microsoft.ACE.OLEDB.12.0', N'AllowInProcess', 1

--GO

--EXEC master.dbo.sp_MSset_oledb_prop N'Microsoft.ACE.OLEDB.12.0', N'DynamicParameters', 1

--GO

---- Using BULK INSERT

--USE PortfolioProject; --GO --BULK INSERT nashvilleHousing FROM 'C:\Temp\SQL Server Management Studio\Nashville Housing Data for Data Cleaning Project.csv' -- WITH ( -- FIELDTERMINATOR = ',', -- ROWTERMINATOR = '\n' --); --GO

---- Using OPENROWSET --USE PortfolioProject; --GO --SELECT * INTO nashvilleHousing --FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0', -- 'Excel 12.0; Database=C:\Users\alexf\OneDrive\Documents\SQL Server Management Studio\Nashville Housing Data for Data Cleaning Project.csv', [Sheet1$]); --GO

niiobuobi commented 9 months ago

Thanks so much for the hard work done by putting your work for some of us to learn from. It helped me solve a my cleaning work that i was struggling with. Thanks again.