NikitaMartynov / Spam_Analysis

Automation of Phishing emails analysis
4 stars 0 forks source link

Bug: url parsing unbalanced brackets #1

Open NikitaMartynov opened 8 years ago

NikitaMartynov commented 8 years ago

The bug is in eml_parser module and not fixed here so far.

If a url was composed in a very complex way. the eml_parser gets lost in the brackets so far observed on [ ]. So far obeserved that it produces a redundant url.

geoli commented 8 years ago

Beautiful Soup seems to be fitting only in cases where one has valid html pages Its success heavily depends on HTML tags which are not present in eml files therefore locating URLs with soup framework in eml files does not seem to be a good approach if the current eml_parser is not good enoughwe might need to try https://github.com/imranghory/urlextractor as an alternative