Is the page rank logic correct?

Oliveirakun commented 3 years ago

Hi,

I've read your article and I'm studying your code in this repo, and I have doubts if the logic in some places are correct. Specifically in this line I think the logic is inverted: If a page is well ranked it should return 1, is that right?

fafal-abnir commented 3 years ago

Hi, It doesn't affect the results of the models because in the exception part we set this feature value 1 but I agree that this is not logically and conceptually correct but it won't affect the final results. thanks for your feedback

Oliveirakun commented 3 years ago

Ok thanks! I have another question: The whole dataset was generated using the feature_extraction in this repo?

fafal-abnir commented 3 years ago

Not all of them, we use a dataset (mentioned in the paper), but we add some samples with the feature extractor.

Oliveirakun commented 3 years ago

Nice, thanks!

Oliveirakun commented 3 years ago

The logic in request url is right too? For what I've read in the original paper from creators of the dataset, the rule is when there are more links from the same domain of the page, it is less suspicious. The implemented code in the feature extractor do the opposite.

fafal-abnir commented 3 years ago

I think you are making a simple mistake here I saw the description of the datasets again url I think both implementation and logic are correct.

Oliveirakun commented 3 years ago

In that link it says: In legitimate webpages, the webpage address and most of objects embedded within the webpage are sharing the same domain

Following the logic in the code, let's imagine that I have a page with 10 links and 9 are from same domain:

percentage  = 9/10 * 100 # percentage == 90.0

if percentage < 22.0:
    data_set.append(1)
elif((percentage >= 22.0) and (percentage < 61.0)):
    data_set.append(0)
else:
        data_set.append(-1)

According to this logic, the result will be -1, but according to original paper if the majority of the links are from same domain(in this case 90% are from same domain and 10% are from other domains) the site is legitimate, so it should fall in the first case, because only 10% of the links are from other domains and 10 < 22.0. Did you understand? Are there something that I missed?

fafal-abnir commented 3 years ago

oooo!! You are right man I should correct it.!!!!!!

Oliveirakun commented 3 years ago

Good! The feature Links in tags has the same problem

Oliveirakun commented 3 years ago

The implementation of Number of Links Pointing to Page also seems incorrect, the original paper describes it as the number of external websites linking to the site, like this service does, and not the total links of the site.

Oliveirakun commented 3 years ago

I didn't find the original dataset that you used and added some examples, could you give me the link for this dataset?

fafal-abnir commented 3 years ago

The implementation of Number of Links Pointing to Page also seems incorrect, the original paper describes it as the number of external websites linking to the site, like this service does, and not the total links of the site.

For this feature, I forget put the latest version of code. I will correct this problem as soon as I could.

fafal-abnir commented 3 years ago

This is the final dataset we used

On Wed, Feb 10, 2021 at 9:33 PM Francis Oliveira notifications@github.com wrote:

I didn't find the original dataset that you used and added some examples, could you give me the link for this dataset?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fafal-abnir/phishing_detection/issues/2#issuecomment-776902578, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZGVUDQT5JBQOU7NWQE7TLS6LC7JANCNFSM4XJZDT3Q .

Oliveirakun commented 3 years ago

Ok, do you plan to generate another dataset after fix these issues?

fafal-abnir commented 3 years ago

I am busy defending my master thesis, I will start correcting these issue about one or two month later and generate new dataset I should thank you for your tips and support

Oliveirakun commented 3 years ago

Nice @fafal-abnir! Thanks for your attention and good luck on your master's thesis defense

fafal-abnir commented 3 years ago

Some of the URLs are gathered from some available datasets but some of the phishing samples extract their feature by this feature_extraction(from PhishTank site). It takes a long time to gather features because it work sequentially and some the feather are gathered from another service (e.g PageRank, traffic .....) You could improve the performance by making feature_extraction multi-threaded or multi-processor ignore some of features for getting better performance.

On Tue, Mar 9, 2021 at 4:56 PM shoviknandy notifications@github.com wrote:

Ok thanks! I have another question: The whole dataset was generated using the feature_extraction in this repo?

did you try it? its taking me too long for 100 urls. Is it normal?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fafal-abnir/phishing_detection/issues/2#issuecomment-793895884, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZGVUGOHQGAMGKJZNGQMHLTCYOYFANCNFSM4XJZDT3Q .

fafal-abnir commented 3 years ago

You can get the top 10,000 top websites from Alexa(e.g google, amazon,......)

On Wed, Mar 10, 2021 at 9:38 AM shoviknandy notifications@github.com wrote:

Ok thanks!... Also do you know where i can get dataset for safe urls?.. phishtank seems to have phishing emails but i can't find one for safe ones

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fafal-abnir/phishing_detection/issues/2#issuecomment-794940192, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZGVUCF3LHSW2APTPDUAJ3TC4EERANCNFSM4XJZDT3Q .

alexx8bits commented 3 years ago

Hello. I understood that the phishing.csv is a dataset generated by you @fafal-abnir using the feature_extraction.py. However, I have tried training different algorithms with this dataset and I am receiving much poor result compared with the same algorithms applied to the original kaggle's dataset.csv. I wonder if this is related with the issue that @Oliveirakun mention above, which makes sense because if the feature_extraction.py is not working well the dataset phishing.csv is not going to be good for training models and predict phishing websites. @fafal-abnir the result shows in the paper are with the dataset.csv or phishing.csv or both of them combined ?????

fafal-abnir / phishing_detection

Is the page rank logic correct? #2