EstAlvB / Phishing-Detection-with-BERT

This project consists of advanced phishing detection using the BERT masked language model.
Apache License 2.0
10 stars 5 forks source link

Request for Assistance with "WEBSITES" Dataset Section in Phishing Detection with BERT Repository #3

Open MehrajIbneHalimBadhon opened 6 days ago

MehrajIbneHalimBadhon commented 6 days ago

Dear EstAlvB,

I hope this message finds you well. My name is Mehraj Ibne Halim, and I am a student at the International Islamic University Chittagong (IIUC), currently working on my thesis, “Detecting Phishing Attacks Using Machine Learning.” I’ve been exploring your “Phishing Detection with BERT” repository on GitHub and found it incredibly relevant to my research.

I am currently focused on working through the "WEBSITES" dataset section in the preprocessing.ipynb file. However, I’m encountering some difficulties in fully understanding and implementing this part of the code. Despite trying various approaches, I am still struggling to proceed effectively. Would it be possible for you to provide some guidance or additional context on this section?

Your insights would be immensely valuable to my thesis, and I would be deeply grateful for any support or advice you could offer. I would be happy to acknowledge your assistance in my research work.

Thank you very much for your time and consideration. I look forward to your response.

Best regards, Mehraj Ibne Halim International Islamic University Chittagong (IIUC) Department: Computer and Communication Engineering (CCE) CGPA: 3.685 Email: e211018@ugrad.iiuc.ac.bd

EstAlvB commented 14 hours ago

Hi @MehrajIbneHalimBadhon,

Sure, with pleasure. What part are you having trouble with specifically?

It's true that processing the data set of websites is a bit complicated. Especially since the dataset is very heavy and divided into several folders.

In the dataset, there's a SQL file that creates a table with information about websites. The table shows if the website is malicious, the link, and the name of the file it refers to. This code creates a database (.db) that will be transformed into a pandas dataframe. I searched for HTML files in the folders using the names from the table and filtered them by size. Processing large HTML files was slow for me, but you can use more samples if you have more processing power.

I've summarized my process, but please let me know if you have specific questions or need further help.

Thank you for considering my project to write part of your thesis, it is an honor.

Best regards, Esteban Alvarado

MehrajIbneHalimBadhon commented 13 hours ago

Hi Esteban Alvarado,

Thank you very much for your response! I appreciate the clear explanation about the "WEBSITES" dataset – it’s been very helpful for my project.

With your guidance, I’ve successfully finished the URL, SMS, and Email datasets using your code, and they are balanced well. But I’m having trouble with the "WEBSITES" dataset. I’ve been working on this for over 10 days, trying different things, but I still can’t get it to work. Since the dataset is very large, I’m testing with smaller files from my Google Drive, but I haven’t had any luck. My laptop has an 11th-generation processor, 8GB RAM, and a 512GB SSD, so handling the full dataset has been challenging.

I was able to create a website.db file and uploaded it to Google Drive for easier access, but I’m still stuck on the HTML extraction part. I’ve tried using your code directly, but I keep getting errors. I’ve even used ChatGPT to help solve these issues, but I still can’t make it work.

Could you help me with this, or share any advice on how to handle the dataset with my limited resources? Also, if possible, could I have your email for easier communication? It would also be great if you could provide a step-by-step guide for working on this project, like starting with data preprocessing, then the phishing classification, and then fine-tuning.

If you could share a smaller sample of the "WEBSITES" dataset or any other dataset you might have, that would help me a lot in testing and working through the process.

Your help would mean a lot to me, and I’d be honored to acknowledge your support in my thesis.

Thank you again for your time and help!