MikeMeliz / TorCrawl.py

Crawl and extract (regular or onion) webpages through TOR network
GNU General Public License v3.0
332 stars 62 forks source link

Improved Keyword Searching #14

Closed the-siegfried closed 2 years ago

the-siegfried commented 2 years ago

Problem Statement When scanning for sites - whether it be on the clearnet or across Tor, the user is often in search of keywords which can be used to categorise the page/site or simply to identify and extract the content of where their keyword(s) have been matched.

Currently, the solution can be used to perform basic keyword searches by piping the extracted output to grep. However, this strictly limited to execution flows that exclude crawling, and is thus it is currently not possible to both crawl, extract and use pipe commands in the same instance.

python torcrawl.py -v -w -u https://google.com -c -d 2 -p 5 -e | grep 'Lucky'

Using the example above we are using pipes in order use grep to perform the keyword search for 'Lucky' in the extracted content of any of the links that had been discovered during the crawl. In order for this to work you would have to run the application to perform crawl and extract operations respectively.

python torcrawl.py -v -w -u https://google.com -c -d 2 -p 5 python torcrawl.py -v -w -u https://google.com -i "path/to/input_file.txt"

Which is fine but it does not allow the user to conveniently record the links that contain the keywords and their content like 'cinex' does. It also means that the user has to get creative if they are searching for multiple keywords or want to be able to categorise and index the results.

Describe the solution you'd like Implement the acceptance of a new argument to act as a flag for keyword searching, and refactor of the implementation of the 'cinex' extractor method to support the Yara rule-based keyword searching and categorisation.

`

if yara:
    full_match_keywords = check_yara(raw=text(response=content).lower())

    if len(full_match_keywords) == 0:
        print('No matches found.')
        continue

`

_Additionally, resolve the the unhandled ValueError(s) raise ValueError("unknown url type: %r" % self.full_url) returned from intermex._

the-siegfried commented 2 years ago

Hi @MikeMeliz, Feel free to take a look at the branch and comment/share ideas here. The branch is still WIP. I have to clean it up, update docstrings and test it yet...

MikeMeliz commented 2 years ago

Alright, just saw the code, that's awesome! I didn't knew about yara, but it totally makes sense to have it as an option in the crawler. You're giving a great direction to this script, I'm sure the folks who're using will love it!!

the-siegfried commented 2 years ago

I am going to add support for a new argument or accept a value for the 'y' argument to allow users to choose whether they wish to perform a keyword search on the text of the document or the entire html response.

MikeMeliz commented 2 years ago

Awesome work @the-siegfried ! Sadly, I didn't have much time to sit down and help with this feature, but I'll try this weekend to start writing some tests for the script!

the-siegfried commented 2 years ago

That's awesome - thanks @MikeMeliz! I think I'm happy with this branch now. So do you want to raise a new issue and branch and we can start writing the tests there? Or did you have something else in mind?

MikeMeliz commented 2 years ago

Great, I just saw the PR too! I'll merge it tomorrow to play around with it a bit :) I'm planning to transfer the project back to my account, in order to incorporate also Travis CI in the merges. So, I'll raise an issue and make a branch on this repository to write the tests. I was thinking pytest would be the easiest approach (with just a folder tests inside modules that'll have all the tests). Also I should remove the guidev branch, as it doesn't really make sense anymore.

the-siegfried commented 2 years ago

Hey, that's great! I look forward to it! Let me know if I can help in any way.