Closed thecoderxman closed 4 years ago
Hey!
Since the project code become a little bit messy and dense, I decided to split the code into the three parts:
construct_features.py - Code that download each URL's HTML code and tokenize all words that would be used for generating the most frequent word list for all available categories.
construct_models.py - After HTML codes from all websites are parsed, the next step is text normalization part. In this code the extracted word tokens of websites text are normalized by removing stop words and translating non english words with the Google Translator. The main goal of this process is to make words tokens to be more english friendly, because the most frequent words list for each category should consist of only english words (it is possible to do that in other languages, but for this project I decided to use english).
train_models.py - Code that trains and tests ML models.
I know that it could be a little bit confusing since README file is not up to date anymore. I'm going to update it soon.
There are some tasks that I'm going to implement it to this project when I got more free time: 1) Update README file with up to date information 2) Create a code that would be available to predict websites manually which could be passed as an argument (That functionality was already implemented in the previous commit versions: https://github.com/domantasm96/URL-categorization-using-machine-learning/blob/bc2a61daeab69458a6d4158120100692a0c272e1/Scripts/predict_url.py 3) Improve ML models by using more advanced Machine Learning frameworks. The accuracy of prediction should also improve
Thanks for asking and if you have any more questions - feel free to ask! :)
Thanks for it and if possible I will try to improve the accuracy and update to you
That would be great!
Please can you upload the main python file ,it would be very helpful