Open anthonyjromann opened 1 year ago
This project is the project I would most want to work on. For starters, I'm just getting into AI/ML so this would be a great way to learn/get experience with working on a project implementing a ML model. Also, I think the concept is great. The day before this presentation I actually got a text from my mom asking if an email was a phishing scam or not, and it was! I know people like my mom would benefit greatly from a product like this, so I think it's a great idea.
I am currently in a Machine Learning course at Temple, so I'm learning about a lot of the concepts you talk about right now. While I don't have a huge background in ML, I think I'd be able to contribute wherever needed. I think the next step of the project would be to implement this as a Chrome extension for Gmail accounts, or something similar to that. This might entail using the score of the email to sort it into either the inbox or the spam folder. I think it would also be a good idea to notify when an email has been selected as phishing to warn the user not to visit the website.
This seems like an interesting project and I can definitely see its use to prevent phishing attacks on people who accidentally click on links often. I'm interested in learning more about machine learning and this seems like a great project for learning ML. I have experience with JavaScript, Python, and HTML so I would be able to help with development. If possible, maybe it could be made as an extension for your phone browser too.
I love your project idea. Phishing scams are a common and persistent threat to internet users. By providing a reliable and accurate tool for assessing the likelihood of a website being a phishing scam, this project can help to reduce the risk of internet users falling victim to these types of attacks. Although I lack prior knowledge of ML, I'm eager to learn more about it through this project. I have some experience with Python, Javascript, and HTML, which could be useful in contributing to the development.
Project Abstract
This project is designed to be a program and/or integrated web extension that inputs a URL and outputs a probability that the inputted URL is a phishing scam. From a user point of view, all that occurs is a simple URL input, or in the case of a browser extension, a click of a button. The output is the reliability score of the website in question. The project uses machine learning to predict if a website is a phishing scam or not.
Conceptual Design
This project will mainly use python, as it has many packages and libraries used for data science and machine learning. Models such as K—nearest neighbor, decision trees, naïve bayes, and support vector machines will be used for classification. XGBoost, polyfit, and linregress will be used for regression, with regplot from seaborn will be used to visualize data. These models can be obtained from NumPy, SciPy, pandas, matplotlib, and scikit-learn packages freely available using pip. Python3 will be used for the actual data preprocessing, model training, model testing, an model evaluation. To implement a web extension, HTML, JavaScript, and CSS will be used. This will create a simple interface that detects the current web URL, passes it through the ML model, and outputs a classification and regression value indicating the reliability of the website in question. HTML and CSS will be used to build and style the extension, while JavaScript will simply deal with user-interaction.
Proof of Concept
https://github.com/shreyagopal/Phishing-Website-Detection-by-Machine-Learning-Techniques/
This project on GitHub employs similar techniques to build a ML model to determine if a website is likely a phishing scam. Similar to what I would like to employ, it uses known phishing URLs and known legitimate URLs to train a ML model (in this case, it employs multiple models and compares their accuracy) and ultimately predict (classify) a URLs reliability. The features extracted from the URLs are similar to what I would like to employ, such as features of the actual URL, features of the domain, and HTML/javascript features. I would like to create my own project rather than further developing this one, because it is trained off of data that is 2+ years old, it also may have out-of-date learning models which could be optimized with state of the art techniques, and it lacks the ability to perform tasks live while in a browser. This project also uses classification, a subset of supervised learning, while I would like to employ regression techniques.
Background
This tool will check URLs for domain squatting-- which is when a common website is registered under a different domain (ex: registering apple.io vs apple.com), URL hijacking-- which takes common website domain typos and registers them to make a website appear to be legitimate (ex: goggle.com vs google.com), also checking other details in the URL such as its length, the number of subdomains in the URL, and the Top-Level Domain (TLD). The project will also check the domain name and its IP address to see if it is blacklisted in any commonly known phishing databases (ex: https:openphish.com/phishing_database.html). Page-based features will be checked to determine how reliable the website seems, websites (ex: PageRank and AWS) can be used for reference data. Finally, content-based features can parse through the code used to develop the website and detect the reliability of said website. All these features combined will use a decision-tree ML algorithm to create a score that assesses the likelihood that a given URL is a phishing scam or not. While other projects exist that use similar ideologies, this project will employ updated models and will use newly updated data to train these models. With all of the different features used to create a predicted output, this project has potential to be more accurate than similar projects.
Required Resources
Resources required for this project can be obtained with a simple internet connection and the ability to access open-source python libraries. Hardware requirements are quite simple as the model will be built on a well-equipped machine and further testing will not require much computing power. Software requirements are the ability to run python files, chrome browser, and either MacOS, Windows, or Linux.
https://docs.google.com/presentation/d/1_KUlJ17vL0WEtdgDoJxCinlk7ZtBJoOTKugZX-KsiPs/edit?usp=sharing