crawlserv / crawlservpp

crawlserv++: Application for crawling and analyzing textual content of websites.
Other
5 stars 0 forks source link

Stand With Ukraine

WARNING! This application is under development. It is neither complete nor adequately tested.

Strikethrough means a feature is not implemented yet.

crawlserv++

crawlserv++ is an application for crawling websites and analyzing textual content on these websites.

For setting up crawlserv++ on Ubuntu 20+, follow the extensive step-by-step installation guide.

The architecture of crawlserv++ consists of three distinct components:

Legal Notice

License: GPL v3

Before using crawlserv++ for crawling websites and other data, please make sure you are legally allowed to do so.

You may not use this software

Building crawlserv++ on Linux

See .travis.yml for example build environments.

You can clone the complete source code into the current folder using git:

git clone https://github.com/crawlserv/crawlservpp .
git submodule init
git submodule update

The following additional components are required to build crawlserv++ on your system:

* Older Linux distributions may only have libtidy-dev v0.9 and liburiparser-dev v0.8.4 available. Install the current versions manually, or add a newer repository, e.g. on Ubuntu via:

echo "deb http://cz.archive.ubuntu.com/ubuntu eoan main universe" | sudo tee -a  /etc/apt/sources.list

After installing these components and cloning or downloading the source code, use the terminal to go to the crawlserv directory inside the downloaded files (where CMakeLists.txt is located) and run the following commands:

mkdir build
cd build
cmake ..

In case of missing source files, make sure that you initialized and updated all submodules (see above).

If cmake was successful and shows Build files have been written to: [...], proceed with:

make

You can safely ignore warnings from external libraries as long as make finishes with [100%] Built target crawlserv.

The program should have been built inside the newly created build directory.

Leave this directory with cd .. before running it.

Note that you need to setup a MySQL server, a frontend (e.g. the one in crawlserv_frontend on a web server with PHP support) and personalize your configuration before finally starting the server with ./build/crawlserv config or any other configuration file as argument. Note that the given default configuration file needs the TOR service running at its default ports 9050 (SOCKS5 proxy) and 9051 (control port). Also note that, if you want to change the location of the program, make sure to take the sql folder with you as it provides basic commands to initialize the database (creating all the global tables on first connection).

The program will ask you for the password of the chosen MySQL user before it proceeds. When Server is up and running. is displayed, switch to the frontend to take control of the command-and-control server.

Even without access to the frontend you can shut down the server from the terminal by sending a SIGINT signal (CTRL+X). It will wait for all running threads to avoid any loss of data.

NB! When compiling the sources manually, the following definitions need to be set in advance:

If you use gcc, add the following arguments to set all of these definitions:

-DPCRE2_CODE_UNIT_WIDTH=8 -DRAPIDJSON_NO_SIZETYPEDEFINE -DRAPIDJSON_HAS_STDSTRING -DZLIB_CONST -DJSONCONS_NO_DEPRECATED -DMG_ENABLE_LOG=0 -DMG_MAX_RECV_BUF_SIZE=10000000000 -DNDEBUG

Command-and-Control Server

The command-and-control server contains an embedded web server (implemented using the mongoose library) for interaction with the frontend by cross-origin resource sharing of JSON code.

In the configuration file, access can (and should) be restricted to specific IPs only.

Source Code Documentation

Documentation

To build the source code documentation you will need doxygen installed. Use the following command inside the root directory of the repository:

doxygen Doxyfile

The documentation will be written to crawlserv/docs.

Server Commands

The server performs commands and sends back their results. Some commands need to be confirmed before being actually performed and some commands can be restricted by the configuration file loaded when starting the server. The following commands are implemented (as of December 2020):

The commands and their replies are using the JSON format (implemented using the RapidJSON library).

File Cache

Apart from these commands, the server automatically handles HTTP file uploads sent as multipart/form-data. The part containing the content of the file needs to be named fileToUpload (case-sensitive). Uploaded files will be saved to the file cache of the server, using random strings of a specific length (defined by Main::randFileNameLength in crawlserv/src/Main/WebServer.hpp) as file names.

The cache is also used to store files generated on data export. Files in the cache can be downloaded using the download server command. Note that these files are temporary as the file cache will be cleared and all uploaded and/or generated files deleted as soon as the server is restarted. Permanent data will be written to the database instead.

Example Exchange

Command from frontend to server: Delete the URL list with the ID #1.

{
 "cmd": "deleteurllist",
 "id": 1,
}

Response from the server: Command needs to be confirmed.

{
 "confirm": true,
 "text": "Do you really want to delete this URL list?\n!!! All associated data will be lost !!!"
}

Response from the frontend: Confirm command.

{
 "cmd": "deleteurllist",
 "id": 1,
 "confirmed": true
}

Response from the server: Success (otherwise "failed":true would be included in the response).

{
 "text": "URL list deleted."
}

For more information on the server commands, see the documentation of the Main::Server class.

Threads

As can be seen from the commands, the server also manages threads for performing specific tasks. In theory, an indefinite number of parallel threads can be run, limited only by the hardware provided for the server. There are four different modules (i.e. types of threads) that are implemented by inheritance from the abstract Module::Thread class:

Configurations for these modules are saved as JSON arrays in the shared configs table.

Analyzers are implemented by their own set of subclasses — algorithm classes. The following algorithms are implemented at the moment (as of December 2020):

The server and each thread have their own connections to the database. These connections are handled by inheritance from the Main::Database class. Additionally, thread connections to the database (instances of Module::Database as child class of Main::Database) are wrapped through the Wrapper::Database class to protect the threads (i.e. their programmers) from accidentally using the server connection to the database and thus compromising thread-safety. See the source code documentation of the command-and-control server for further details.

The parser, extractor and analyzer threads may pre-cache (and therefore temporarily multiply) data in memory, while the crawler threads work directly on the database, which minimizes memory usage. Because the usual bottleneck for parsers and extractors are requests to the crawled/extracted website, multiple threads are encouraged for crawling and extracting. Multiple threads for parsing can be reasonable when using multiple CPU cores, although some additional memory usage by the in-memory multiplication of data should be expected as well as some blocking because of simultaneous database access. At the same time, a slow database connection or server can have significant impact on performance in any case.

Algorithms need to be specifically optimized for multi-threading. Otherwise, multiple analyzer threads will not improve performance and might even conflict with each other.

Third-party Libraries

The following third-party libraries are used by the command-and-control server:

While Asio, date.h, jsoncons, Mongoose, porter2_stemmer, RapidJSON, UTF8-CPP, and Wapiti are included in the source code and compiled together with the server, all other libraries need to be externally present.

Frontend

NB! The current frontend is a quick-and-dirty solution to test the full functionality of the server. Feel free to implement your own nice frontend solution in your favorite programming language – all you need is a read-only connection to the MySQL database and a HTTP connection for exchanging JSON with the command-and-control server. You may also want to use the provided JSON files in crawlserv_frontend/crawlserv/json as keeping them up-to-date will inform you about module-specific configuration changes and the implementation of new algorithms.

This frontend is a simple HTML/PHP and JavaScript application that has read-only access to the database on its own and can (under certain conditions) interact with the command-and-control server using the above listed commands when the user wants to perform actions that could change the content of the database.

It provides the following menu structure:

Third-party Code

The frontend uses the following third-party JavaScript code (to be found in crawlserv_frontend/crawlserv/js/external:

Configuration

The server needs a configuration file as argument, the test configuration can be found at crawlserv/config. Replace the values in this file with those for your own database and server settings. The password for granting the server (full) access to the database will be prompted when starting the server.

The frontend uses the config.php to gain read-only access to the database. For security reasons, the database account used by the frontend should only have SELECT privilege! See this file for details about the test configuration (including the database schema and the user name and password for read-only access to the test database). Replace those values with those for your own database.

The testing environment consists of one PC that runs all three components of the application which can only be accessed locally (using localhost). Therefore, the (randomly created) password in config.php is irrelevant for usage outside the original test environment and needs to be replaced! In this (test) case, the command-and-control server uses port 8080 for interaction with the frontend while the web server running the frontend uses port 80 for interaction with the user (i.e. his*her web browser). The MySQL database server uses (default) port 3306.

Please note, that the MySQL server used by crawlserv++ might need some adjustments. First of all, the default character set should be set to standard UTF-8 (utf8mb4). Second of all, when processing large data, the max_allowed_packet should be adjusted, and maybe even set to the maximum value of 1 GiB. See this example mysql.cnf:

[mysqld]
character-set-server = utf8mb4
max_allowed_packet = 1G

On the client side, crawlserv++ will set these values automatically.

Using some algorithms on large corpora may require a large amount of memory. Consider adjusting the size of your swap if memory usage reaches its limit to avoid the server from being killed by the operating system.

Database

The application uses exactly one database schema and all tables are prefixed with crawlserv_.

The following main tables are created and used:

If not already existing, these tables will be created on startup of the command-and-control server by executing the SQL commands in crawlserv/sql/init.sql. See this file for details about the structure of these tables. The result tables specified in crawlserv_parsedtables, crawlserv_extractedtables and crawlserv_analyzedtables will be created by the different modules as needed (with the structure needed for the performance of the specified tasks).

For each website and each URL list a namespace of at least four allowed characters (a-z, A-Z, 0-9, $, \_) is used. These namespaces determine the names of the tables used for each URL list (also prefixed by crawlserv_):

See the source code of the addUrlList(...) function in Main::Database for details about the structure of the non-result tables. Most of the columns of the result tables are specified by the respective parsing, extracting and analyzing configurations. See the code of the initTargetTable(...) functions in Module::Parser::Database, Module::Extractor::Database and Module::Analyzer::Database accordingly.

Platform

At the moment, this software has been developed for and tested on Linux only.

Developed with Eclipse 2020-03 (4.15.0), Eclipse CDT 9.11.0, Eclipse PDT 7.1.0 and Eclipse Web Tools Platform 0.6.100. Compiled and linked with GNU Make 4.2.1, cmake/ccmake 3.16.3, gcc 9.3.0. Tested with Apache/2.4.41 and MySQL 8.0.21 on Ubuntu 20.04.1 LTS [focal] and Ubuntu 21.10 [impish] (both 64-bit).

The frontend is optimized for current versions of Mozilla Firefox (e.g. v79.0), but should also run on Chromium (e.g. v84.0), and other modern browsers.