Open ghost opened 3 years ago
consider an entry of Nginx access log file (eg "access_log /var/log/nginx/access.log custom;") looks like this :
153.78.107.192 - - [21/Nov/2017:08:45:45 +0000] "POST /ngx_pagespeed_beacon?url=https%3A%2F%2Fwww.example.com%2Fads%2Ffresh-oranges-1509260795 HTTP/2.0" 204 0 "https://www.suasell.com/ads/fresh-oranges-1509260795" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0" "-" a02b2dea9cf06344a25611c1d7ad72db Uganda UG Kampala Kampala
(source https://www.tecmint.com/configure-custom-access-and-error-log-formats-in-nginx/)
you can strip "https://www.suasell.com/ads/fresh-oranges-1509260795" from the log and proceed with the method. Tweaks are required according to the dataset and use case.
This model is only good when the classification is to be done only on the basis of URL. if you want to make a more efficient model, I would suggest the 'content-based URL classifier. For that, you will need a 'spam phrase(or sentence) classifier'. To classify an URL you have to parse the HTML content of the URL ->> pass the content of the URL to the 'spam phrase classifier' ->> if the content is spam then mark the URL as spam.
serving both the model is not so difficult in the flask as parsing is a very easy task in python.
I hope it will help you. If you need any further help or contribution then feel free to contact me.
have a good day, Vivek Sahani
Hi,
Thanks for the quick reply :-)
I just want to classify url not the content because I would like to create a real time web service, so grabbing the content of the page would be overkill (at least for a first draft).
Here is a preview of my project's interface:
I am not really python pro-efficient but much more a gopher, so if you can help me to draft a skeleton that would be amazing.
Cheers, Luc Michalski
Hi,
Hope you are all well !
I am posting this issue because I have a question related to your repo and my current project. :-)
I am working on a project that aim to classify legit/illegitmate urls from apache2/nginx access logs; in a nusthell, it is about detecting xss/sqli/legit requests. And, I have a dataset of 1M log entries with some additional data per row (asn,country,city...) and flagged if the line is legit or not.
So my questions would be the following:
Thanks for any insights or inputs on these points.
Take care, Luc Michalski