[ ]
Phishing attacks have emerged as a significant and persistent threat in the digital landscape, targeting individuals, organizations, and even governments. These deceptive techniques employed by cybercriminals aim to trick unsuspecting users into divulging sensitive information, such as login credentials, financial details, or personal data.
Research shows that Over 48% of emails sent in 2022 were spam, and up to an estimated 3.4 billion spam emails sent every day. Globally, 323,972 internet users fell victim to phishing attacks in 2021 and With an average of $136 lost per phishing attack, this amounts to $44.2 million stolen by cyber criminals through phishing attacks in 2021.
Phishing attacks pose a significant threat to online users, compromising their privacy, financial security, and trust in online interactions. Detecting and mitigating phishing sites remains challenging, requiring effective techniques to identify and differentiate between legitimate and malicious websites accurately.
Existing phishing detection methods often struggle to keep pace with the evolving tactics employed by cybercriminals, necessitating the development of an enhanced approach for phishing site detection.
Therefore, a critical need is to develop an improved system combining advanced machine learning techniques, feature engineering, and behavioural analysis to detect phishing sites accurately and efficiently. By addressing these challenges, the proposed methodology aims to improve the security of online users, protect their sensitive information, and foster a safer digital environment.
The aim is to contribute to developing a more secure digital environment by offering an advanced approach to phishing site detection. By accurately identifying and mitigating phishing threats, the proposed model will enhance the safety and trustworthiness of online interactions, protecting users from falling victim to phishing attacks.
In the following sections, we will discuss the related literature, present the methodology, describe the experiments and results, and conclude with the implications and future directions of the research.
• Datasets containing phishing and legitimate websites is collected from open-source platform PhishTank.
• Write a code to extract the required features from the URL database.
• Analyze and preprocess the dataset by using EDA techniques.
• Divide the dataset into training and testing sets.
• Run selected machine learning and deep neural network algorithms on the dataset like Decision Tree , Random Forest, Multilayer Perceptrons, XGBoost, Autoencoder Neural Networks and Support Vector Machines on the dataset .
• Write a code for displaying the evaluation result considering accuracy metrics.
• Compare the obtained results for trained models and specify which is better.
1) Tensoflow
2) Numpy
3) Pandas
4) SciKit-Learn
Datasets containing phishing and legitimate websites is collected from open-source platform PhishTank. click here!
This service provide a set of phishing URLs in multiple formats like csv, json etc. that gets updated hourly. From this dataset, 5000 random phishing URLs are collected to train the machine learning models.
The legitimate URLs are obatined from the open datasets of the University of New Brunswick, click here!. This dataset has a collection of benign, spam, phishing, malware & defacement URLs. Out of all these types, the benign url dataset is considered for this project. From this dataset, 5000 random legitimate URLs are collected to train the ML models.
The below-mentioned category of features are extracted from the URL data:
Addressed Bar-based features
• In this category, 9 features are extracted.
Domain-based Features
• In this category, 4 features are extracted.
HTML & Javascript-based Features
• In this category, 4 features are extracted.
So, all together 17 features are extracted from the 10,000 URL dataset and are stored in '5.urldata.csv' file in the Data Files folder
Before starting the ML model training, the data is split into 80-20, i.e., 8000 training samples & 2000 testing samples. From the dataset, it is clear that this is a supervised machine-learning task.
This data set comes under a classification problem, as the input URL is classified as phishing (1) or legitimate (0).
The supervised machine learning models (classification) considered to train the dataset in this project are:
• Decision Tree
• Random Forest
• Multilayer Perceptrons
• XGBoost
• Autoencoder Neural Network
• Support Vector Machines
save the model and calculate the training and testing accuracy ,
We did 50 epochs, to get a good accuracy from the XGBoost model i.e. 86.7% for training accuracy and 85.8% for testing accuracy.
1) **Browser Extension :** This project can be taken further by creating a browser extensions by developing a GUI. 2) The machine learning models shown here can be easily served as REST API endpoints which can further be used with add-ons to detect phishing websites in real-time. 3) As this is a software solution this can be easily intergreted into various platfroms with minimum issues and effort, futhermore as we encounter new links we can forvever improve on the accuracy by getting real time feedback from users.