AshwinAmbal / DarkWeb-Crawling-Indexing

A DarkWeb Crawler based off the open-source TorSpider. Indexing with search engine created using Apache Solr.
Apache License 2.0
34 stars 16 forks source link
crawling darkweb indexing

DarkWeb Crawler and Indexer

A basic scrapper made in python with BeautifulSoup and Tor support to -

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

[sudo] apt-get install python3 python3-dev
[sudo] pip3 install -r requirements.txt

TL;DR: We recommend installing oursystem inside a virtual environment on all platforms.

Python packages can be installed either globally (a.k.a system wide), or in user-space. We do not recommend installing TorScrapper system wide.

Instead, we recommend that you install our system within a so-called “virtual environment” (virtualenv). Virtualenvs allow you to not conflict with already-installed Python system packages (which could break some of your system tools and scripts), and still install packages normally with pip (without sudo and the likes).

To get started with virtual environments, see virtualenv installation instructions. To install it globally (having it globally installed actually helps here), it should be a matter of running:

[sudo] pip install virtualenv

Basic setup

Before you run the torBot make sure the following things are done properly:

Read more about torrc here : Torrc

Deployment

A step by step series of examples that tells what you have to do to get this project running -

[nano]/[vim]/[gedit]/[Your choice of editor] onions.txt
[sudo] python3 TorScrapper.py

Built With

Indexing with Apache Solr-8.0.0

Pre-requirements

Basic Installation Setup and Commands

cd ~/
tar zxf solr-8.0.0.tgz

Once extracted, Apache Solr is now ready to run by following the instructions given below.

[To the directory of Apache Solr-8.0.0] bin/solr start 
[To the directory of Apache Solr-8.0.0] bin/solr status 
[To the directory of Apache Solr-8.0.0] bin/solr stop 

Indexing the crawled files

Following step-by-step instructions will help you to index the crawled files.

solr-8.0.0:$ ./bin/solr start -e cloud

To begin, how many Solr nodes would you like to run in your local cluster? (specify 1-4 nodes) [2]:[Enter]
Ok, let's start up 2 Solr nodes for your example SolrCloud cluster.
Please enter the port for node1 [8983]:[Press Enter]
Please enter the port for node2 [7574]:[Press Enter]

Now let's create a new collection for indexing documents in your 2-node cluster.
Please provide a name for your new collection: [gettingstarted]
[enter your desired collection name. The name we have used here is **crawled**]
How many shards would you like to split techproducts into? [2]
How many replicas per shard would you like to create? [2]
Please choose a configuration for the techproducts collection, available options are:
_default or sample_techproducts_configs [_default] 
[Press Enter]
solr-8.0.0:$ bin/post -c crawled [path_where_the_crawled_files_are_stored]/*
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/crawled/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file [filename_1].html (text/html) to [base]
POSTing file [filename_2].html (text/html) to [base]
POSTing file [filename_3].html (text/html) to [base]
POSTing file [filename_4].html (text/html) to [base]
.....

[Total_number_of_files] files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/crawled/update...
Time spent:

Post Indexing - Start the Web page

NOTE: If any CORS Error occurs during the searching process in the Chrome browser, add CORS extension to the browser.

Authors